Re: [MASSMAIL]Re: How to read metadata/content of an URL in URLFilter?

Jorge Luis Betancourt González Mon, 23 Feb 2015 12:57:50 -0800

My two cents on the topic: 

The URLFilter family plugin are handled by the URLFilfters class, this class 
gets instantiated in several places in the source code, including the Fetcher 
and the Injector. The URLFilters class uses PluginRepository.get() method to 
load the plugins, this method indeed use a cache based on the UUID of the 
NutchConfiguration object passed as an argument, this generated UUID can be 
found inside the config object under the "nutch.conf.uuid" key, for what I can 
see in the NutchConfiguration class each time the create() method is called a 
new instance of the Configuration class is also created and a new UUID 
generated, the new UUID will cause a cache miss and a new PluginRepository will 
be created and cached.


----- Original Message -----

From: "Renxia Wang" <[email protected]> 
To: [email protected] 
Sent: Monday, February 23, 2015 1:00:30 AM 
Subject: [MASSMAIL]Re: How to read metadata/content of an URL in URLFilter? 

I log the instance id and get the result: 

2015-02-22 21:42:15,972 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
423250256 
2015-02-22 21:42:24,782 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
828433560 
2015-02-22 21:42:24,795 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
828433560 
2015-02-22 21:42:24,804 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
828433560 
... 
2015-02-22 21:42:25,039 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
828433560 
2015-02-22 21:42:25,041 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
828433560 
2015-02-22 21:42:28,282 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1240209240 
2015-02-22 21:42:28,292 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1240209240 
... 
2015-02-22 21:42:28,487 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1240209240 
2015-02-22 21:42:28,489 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1240209240 
2015-02-22 21:42:43,984 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1818924295 
2015-02-22 21:42:44,090 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1818924295 
... 
2015-02-22 21:42:53,404 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1818924295 
2015-02-22 21:44:08,533 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
908006650 
2015-02-22 21:44:08,544 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
908006650 
... 
2015-02-22 21:44:10,418 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
908006650 
2015-02-22 21:44:10,420 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
908006650 
2015-02-22 21:44:14,467 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
619451848 
2015-02-22 21:44:14,478 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
619451848 
... 
2015-02-22 21:44:15,643 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
619451848 
2015-02-22 21:44:15,644 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
619451848 
2015-02-22 21:44:26,189 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1343455839 
2015-02-22 21:44:28,501 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1343455839 
... 
2015-02-22 21:45:29,707 INFO exactdup.ExactDupURLFilter - URlFilter ID: 
1343455839 

As the url filters are called inthe injector and crawldb update, I grep: 
➜ local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log 
2015-02-22 21:42:14,896 INFO crawl.Injector - Injector: starting at 2015-02-22 
21:42:14 

Which means the URlFilter ID: 423250256 is the one created in the injector. 

➜ local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log 
2015-02-22 21:42:25,951 INFO crawl.CrawlDb - CrawlDb update: starting at 
2015-02-22 21:42:25 
2015-02-22 21:44:11,208 INFO crawl.CrawlDb - CrawlDb update: starting at 
2015-02-22 21:44:11 

Here is confusing, there are 6 unique urlfilter ids after injector, while there 
are only two crawldb update. 

On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) < 
[email protected] > wrote: 


Cool, good test. I thought the Nutch plugin system cached instances 
of plugins - I am not sure if it creates a new one each time. are you 
sure you don’t have the same URLFilter instance, it’s just called on 
different datasets and thus produces different counts? 

Either way, so you should simply proceed with the filters in whatever 
form they are working in (cached or not). 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
Chris Mattmann, Ph.D. 
Chief Architect 
Instrument Software and Science Data Systems Section (398) 
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA 
Office: 168-519, Mailstop: 168-527 
Email: [email protected] 
WWW: http://sunset.usc.edu/~mattmann/ 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
Adjunct Associate Professor, Computer Science Department 
University of Southern California, Los Angeles, CA 90089 USA 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 






-----Original Message----- 
From: Renxia Wang < [email protected] > 
Reply-To: " [email protected] " < [email protected] > 
Date: Sunday, February 22, 2015 at 9:16 PM 
To: " [email protected] " < [email protected] > 
Subject: Re: How to read metadata/content of an URL in URLFilter? 

>I just added a counter in my URLFilter, and prove that the URLFilter 
>instances in each fetching circle are different. 
> 
> 
>Sample logs: 
>2015-02-22 21:07:10,636 INFO exactdup.ExactDupURLFilter - Processed 69 
>links 
>2015-02-22 21:07:10,638 INFO exactdup.ExactDupURLFilter - Processed 70 
>links 
>2015-02-22 21:07:10,640 INFO exactdup.ExactDupURLFilter - Processed 71 
>links 
>2015-02-22 21:07:10,641 INFO exactdup.ExactDupURLFilter - Processed 72 
>links 
>2015-02-22 21:07:10,643 INFO exactdup.ExactDupURLFilter - Processed 73 
>links 
>2015-02-22 21:07:10,645 INFO exactdup.ExactDupURLFilter - Processed 74 
>links 
>2015-02-22 21:07:10,647 INFO exactdup.ExactDupURLFilter - Processed 75 
>links 
>2015-02-22 21:07:10,649 INFO exactdup.ExactDupURLFilter - Processed 76 
>links 
>2015-02-22 21:07:10,650 INFO exactdup.ExactDupURLFilter - Processed 77 
>links 
>2015-02-22 21:07:13,835 INFO exactdup.ExactDupURLFilter - Processed 1 
>links 
>2015-02-22 21:07:13,850 INFO exactdup.ExactDupURLFilter - Processed 2 
>links 
>2015-02-22 21:07:13,865 INFO exactdup.ExactDupURLFilter - Processed 3 
>links 
>2015-02-22 21:07:13,878 INFO exactdup.ExactDupURLFilter - Processed 4 
>links 
>2015-02-22 21:07:13,889 INFO exactdup.ExactDupURLFilter - Processed 5 
>links 
>2015-02-22 21:07:13,899 INFO exactdup.ExactDupURLFilter - Processed 6 
>links 
> 
> 
> 
>Not sure if it is configurable? 
> 
> 
> 
> 
>On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) 
>< [email protected] > wrote: 
> 
>That’s one way - for sure - but what I was implying is that 
>you can train (read: feed data into) your model (read: algorithm) 
>using previously crawled information. So, no I wasn’t implying 
>machine learning. 
> 
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>Chris Mattmann, Ph.D. 
>Chief Architect 
>Instrument Software and Science Data Systems Section (398) 
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA 
>Office: 168-519, Mailstop: 168-527 
>Email: [email protected] 
>WWW: http://sunset.usc.edu/~mattmann/ 
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>Adjunct Associate Professor, Computer Science Department 
>University of Southern California, Los Angeles, CA 90089 USA 
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
> 
> 
> 
> 
> 
> 
>-----Original Message----- 
>From: Renxia Wang < [email protected] > 
>Reply-To: " [email protected] " < [email protected] > 
>Date: Sunday, February 22, 2015 at 8:47 PM 
>To: " [email protected] " < [email protected] > 
>Subject: Re: How to read metadata/content of an URL in URLFilter? 
> 
>>Hi Prof Mattmann, 
>> 
>> 
>>You are saying "train" and "model", are we expected to use machine 
>>learning algorithms to train model for duplication detection? 
>> 
>> 
>>Thanks, 
>> 
>> 
>>Renxia 
>> 
>> 
>>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) 
>>< [email protected] > wrote: 
>> 
>>There is nothing stating in your assignment that you can’t 
>>use *previously* crawled data to train your model - you 
>>should have at least 2 full sets of this. 
>> 
>>Cheers, 
>>Chris 
>> 
>> 
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>Chris Mattmann, Ph.D. 
>>Chief Architect 
>>Instrument Software and Science Data Systems Section (398) 
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA 
>>Office: 168-519, Mailstop: 168-527 
>>Email: [email protected] 
>>WWW: http://sunset.usc.edu/~mattmann/ 
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>Adjunct Associate Professor, Computer Science Department 
>>University of Southern California, Los Angeles, CA 90089 USA 
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>> 
>> 
>> 
>> 
>> 
>> 
>>-----Original Message----- 
>>From: Majisha Parambath < [email protected] > 
>>Reply-To: " [email protected] " < [email protected] > 
>>Date: Sunday, February 22, 2015 at 8:30 PM 
>>To: dev < [email protected] > 
>>Subject: Re: How to read metadata/content of an URL in URLFilter? 
>> 
>>> 
>>> 
>>> 
>>>My understanding is that the LinkDB or CrawlDB will contain the results 
>>>of previously fetched and parsed pages. 
>>> 
>>>However if we want to get the contents of a URL/page in the URL 
>>>Filtering 
>>>stage( 
>>>which is not yet fetched) , is there any util in Nutch that we can use 
>>>to fetch the contents of the page ? 
>>> 
>>> 
>>>Thanks and regards, 
>>>Majisha Namath Parambath 
>>>Graduate Student, M.S in Computer Science 
>>>Viterbi School of Engineering 
>>>University of Southern California, Los Angeles 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) 
>>>< [email protected] > wrote: 
>>> 
>>>In the constructor of your URLFilter, why not consider passing 
>>>in a NutchConfiguration object, and then reading the path to e.g, 
>>>the LinkDb from the config. Then have a private member variable 
>>>for the LinkDbReader (maybe static initialized for efficiency) 
>>>and use that in your interface method. 
>>> 
>>>Cheers, 
>>>Chris 
>>> 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>>Chris Mattmann, Ph.D. 
>>>Chief Architect 
>>>Instrument Software and Science Data Systems Section (398) 
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA 
>>>Office: 168-519, Mailstop: 168-527 
>>>Email: [email protected] 
>>>WWW: http://sunset.usc.edu/~mattmann/ 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>>Adjunct Associate Professor, Computer Science Department 
>>>University of Southern California, Los Angeles, CA 90089 USA 
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>-----Original Message----- 
>>>From: Renxia Wang < [email protected] > 
>>>Reply-To: " [email protected] " < [email protected] > 
>>>Date: Sunday, February 22, 2015 at 3:36 PM 
>>>To: " [email protected] " < [email protected] > 
>>>Subject: How to read metadata/content of an URL in URLFilter? 
>>> 
>>>> 
>>>> 
>>>> 
>>>>Hi 
>>>> 
>>>> 
>>>>I want to develop an UrlFIlter which takes an url, takes its metadata 
>>>>or 
>>>>even the fetched content, then use some duplicate detection algorithms 
>>>>to 
>>>>determine if it is a duplicate of any url in bitch. However, the only 
>>>>parameter passed into the Urlfilter 
>>>> is the url, is it possible to get the data I want of that input url in 
>>>>Urlfilter? 
>>>> 
>>>> 
>>>>Thanks, 
>>>> 
>>>> 
>>>>Zhique 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> 
> 
> 
>

Re: [MASSMAIL]Re: How to read metadata/content of an URL in URLFilter?

Reply via email to