Thanks Jorge for your useful information. So since there are multiple URLFilter instances being created during crawling, is there any way to share data among them? Like a hashmap, which may be useful to my purpose, duplicate detection. Or use a external in-memory database?
I am also failed to get the path of linkdb/segments/crawldb. Here is what I did: I implemented the setConf and getConf methods in the urlfilter, and the pass the conf to the LinkDbReader/CrawlDbReader/SegmentReader with a path. There are so many properties in conf, like mapred.input.dir, but it sometimes pointing to a specific version of linkdb and somtimes it points to segments. I also tried to just hard code the path, but it throws null pointer exception. I am particular interested in reading the parse_data in segments since I need to parsed metadata. Any thought about getting this work? Thanks, Zhique On Mon, Feb 23, 2015 at 12:56 PM, Jorge Luis Betancourt González < [email protected]> wrote: > My two cents on the topic: > > The URLFilter family plugin are handled by the URLFilfters class, this > class gets instantiated in several places in the source code, including the > Fetcher and the Injector. The URLFilters class uses PluginRepository.get() > method to load the plugins, this method indeed use a cache based on the > UUID of the NutchConfiguration object passed as an argument, this generated > UUID can be found inside the config object under the "nutch.conf.uuid" key, > for what I can see in the NutchConfiguration class each time the create() > method is called a new instance of the Configuration class is also created > and a new UUID generated, the new UUID will cause a cache miss and a new > PluginRepository will be created and cached. > > ------------------------------ > *From: *"Renxia Wang" <[email protected]> > *To: *[email protected] > *Sent: *Monday, February 23, 2015 1:00:30 AM > *Subject: *[MASSMAIL]Re: How to read metadata/content of an URL in > URLFilter? > > > I log the instance id and get the result: > > 2015-02-22 21:42:15,972 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 423250256 > 2015-02-22 21:42:24,782 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 828433560 > 2015-02-22 21:42:24,795 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 828433560 > 2015-02-22 21:42:24,804 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 828433560 > ... > 2015-02-22 21:42:25,039 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 828433560 > 2015-02-22 21:42:25,041 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 828433560 > 2015-02-22 21:42:28,282 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1240209240 > 2015-02-22 21:42:28,292 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1240209240 > ... > 2015-02-22 21:42:28,487 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1240209240 > 2015-02-22 21:42:28,489 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1240209240 > 2015-02-22 21:42:43,984 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1818924295 > 2015-02-22 21:42:44,090 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1818924295 > ... > 2015-02-22 21:42:53,404 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1818924295 > 2015-02-22 21:44:08,533 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 908006650 > 2015-02-22 21:44:08,544 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 908006650 > ... > 2015-02-22 21:44:10,418 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 908006650 > 2015-02-22 21:44:10,420 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 908006650 > 2015-02-22 21:44:14,467 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 619451848 > 2015-02-22 21:44:14,478 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 619451848 > ... > 2015-02-22 21:44:15,643 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 619451848 > 2015-02-22 21:44:15,644 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 619451848 > 2015-02-22 21:44:26,189 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1343455839 > 2015-02-22 21:44:28,501 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1343455839 > ... > 2015-02-22 21:45:29,707 INFO exactdup.ExactDupURLFilter - URlFilter ID: > 1343455839 > > As the url filters are called inthe injector and crawldb update, I grep: > ➜ local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log > 2015-02-22 21:42:14,896 INFO crawl.Injector - Injector: starting at > *2015-02-22 > 21:42:14* > > Which means the URlFilter ID: 423250256 is the one created in the > injector. > > ➜ local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log > 2015-02-22 21:42:25,951 INFO crawl.CrawlDb - CrawlDb update: starting at > 2015-02-22 21:42:25 > 2015-02-22 21:44:11,208 INFO crawl.CrawlDb - CrawlDb update: starting at > 2015-02-22 21:44:11 > > Here is confusing, there are 6 unique urlfilter ids after injector, while > there are only two crawldb update. > > On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) < > [email protected]> wrote: > >> Cool, good test. I thought the Nutch plugin system cached instances >> of plugins - I am not sure if it creates a new one each time. are you >> sure you don’t have the same URLFilter instance, it’s just called on >> different datasets and thus produces different counts? >> >> Either way, so you should simply proceed with the filters in whatever >> form they are working in (cached or not). >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> -----Original Message----- >> From: Renxia Wang <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Sunday, February 22, 2015 at 9:16 PM >> To: "[email protected]" <[email protected]> >> Subject: Re: How to read metadata/content of an URL in URLFilter? >> >> >I just added a counter in my URLFilter, and prove that the URLFilter >> >instances in each fetching circle are different. >> > >> > >> >Sample logs: >> >2015-02-22 21:07:10,636 INFO exactdup.ExactDupURLFilter - Processed 69 >> >links >> >2015-02-22 21:07:10,638 INFO exactdup.ExactDupURLFilter - Processed 70 >> >links >> >2015-02-22 21:07:10,640 INFO exactdup.ExactDupURLFilter - Processed 71 >> >links >> >2015-02-22 21:07:10,641 INFO exactdup.ExactDupURLFilter - Processed 72 >> >links >> >2015-02-22 21:07:10,643 INFO exactdup.ExactDupURLFilter - Processed 73 >> >links >> >2015-02-22 21:07:10,645 INFO exactdup.ExactDupURLFilter - Processed 74 >> >links >> >2015-02-22 21:07:10,647 INFO exactdup.ExactDupURLFilter - Processed 75 >> >links >> >2015-02-22 21:07:10,649 INFO exactdup.ExactDupURLFilter - Processed 76 >> >links >> >2015-02-22 21:07:10,650 INFO exactdup.ExactDupURLFilter - Processed 77 >> >links >> >2015-02-22 21:07:13,835 INFO exactdup.ExactDupURLFilter - Processed 1 >> >links >> >2015-02-22 21:07:13,850 INFO exactdup.ExactDupURLFilter - Processed 2 >> >links >> >2015-02-22 21:07:13,865 INFO exactdup.ExactDupURLFilter - Processed 3 >> >links >> >2015-02-22 21:07:13,878 INFO exactdup.ExactDupURLFilter - Processed 4 >> >links >> >2015-02-22 21:07:13,889 INFO exactdup.ExactDupURLFilter - Processed 5 >> >links >> >2015-02-22 21:07:13,899 INFO exactdup.ExactDupURLFilter - Processed 6 >> >links >> > >> > >> > >> >Not sure if it is configurable? >> > >> > >> > >> > >> >On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) >> ><[email protected]> wrote: >> > >> >That’s one way - for sure - but what I was implying is that >> >you can train (read: feed data into) your model (read: algorithm) >> >using previously crawled information. So, no I wasn’t implying >> >machine learning. >> > >> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >Chris Mattmann, Ph.D. >> >Chief Architect >> >Instrument Software and Science Data Systems Section (398) >> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> >Office: 168-519, Mailstop: 168-527 >> >Email: [email protected] >> >WWW: http://sunset.usc.edu/~mattmann/ >> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >Adjunct Associate Professor, Computer Science Department >> >University of Southern California, Los Angeles, CA 90089 USA >> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> > >> > >> > >> > >> > >> > >> >-----Original Message----- >> >From: Renxia Wang <[email protected]> >> >Reply-To: "[email protected]" <[email protected]> >> >Date: Sunday, February 22, 2015 at 8:47 PM >> >To: "[email protected]" <[email protected]> >> >Subject: Re: How to read metadata/content of an URL in URLFilter? >> > >> >>Hi Prof Mattmann, >> >> >> >> >> >>You are saying "train" and "model", are we expected to use machine >> >>learning algorithms to train model for duplication detection? >> >> >> >> >> >>Thanks, >> >> >> >> >> >>Renxia >> >> >> >> >> >>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) >> >><[email protected]> wrote: >> >> >> >>There is nothing stating in your assignment that you can’t >> >>use *previously* crawled data to train your model - you >> >>should have at least 2 full sets of this. >> >> >> >>Cheers, >> >>Chris >> >> >> >> >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>Chris Mattmann, Ph.D. >> >>Chief Architect >> >>Instrument Software and Science Data Systems Section (398) >> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> >>Office: 168-519, Mailstop: 168-527 >> >>Email: [email protected] >> >>WWW: http://sunset.usc.edu/~mattmann/ >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>Adjunct Associate Professor, Computer Science Department >> >>University of Southern California, Los Angeles, CA 90089 USA >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >> >> >> >> >> >> >> >>-----Original Message----- >> >>From: Majisha Parambath <[email protected]> >> >>Reply-To: "[email protected]" <[email protected]> >> >>Date: Sunday, February 22, 2015 at 8:30 PM >> >>To: dev <[email protected]> >> >>Subject: Re: How to read metadata/content of an URL in URLFilter? >> >> >> >>> >> >>> >> >>> >> >>>My understanding is that the LinkDB or CrawlDB will contain the results >> >>>of previously fetched and parsed pages. >> >>> >> >>>However if we want to get the contents of a URL/page in the URL >> >>>Filtering >> >>>stage( >> >>>which is not yet fetched) , is there any util in Nutch that we can use >> >>>to fetch the contents of the page ? >> >>> >> >>> >> >>>Thanks and regards, >> >>>Majisha Namath Parambath >> >>>Graduate Student, M.S in Computer Science >> >>>Viterbi School of Engineering >> >>>University of Southern California, Los Angeles >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) >> >>><[email protected]> wrote: >> >>> >> >>>In the constructor of your URLFilter, why not consider passing >> >>>in a NutchConfiguration object, and then reading the path to e.g, >> >>>the LinkDb from the config. Then have a private member variable >> >>>for the LinkDbReader (maybe static initialized for efficiency) >> >>>and use that in your interface method. >> >>> >> >>>Cheers, >> >>>Chris >> >>> >> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>>Chris Mattmann, Ph.D. >> >>>Chief Architect >> >>>Instrument Software and Science Data Systems Section (398) >> >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> >>>Office: 168-519, Mailstop: 168-527 >> >>>Email: [email protected] >> >>>WWW: http://sunset.usc.edu/~mattmann/ >> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>>Adjunct Associate Professor, Computer Science Department >> >>>University of Southern California, Los Angeles, CA 90089 USA >> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>>-----Original Message----- >> >>>From: Renxia Wang <[email protected]> >> >>>Reply-To: "[email protected]" <[email protected]> >> >>>Date: Sunday, February 22, 2015 at 3:36 PM >> >>>To: "[email protected]" <[email protected]> >> >>>Subject: How to read metadata/content of an URL in URLFilter? >> >>> >> >>>> >> >>>> >> >>>> >> >>>>Hi >> >>>> >> >>>> >> >>>>I want to develop an UrlFIlter which takes an url, takes its metadata >> >>>>or >> >>>>even the fetched content, then use some duplicate detection algorithms >> >>>>to >> >>>>determine if it is a duplicate of any url in bitch. However, the only >> >>>>parameter passed into the Urlfilter >> >>>> is the url, is it possible to get the data I want of that input url >> in >> >>>>Urlfilter? >> >>>> >> >>>> >> >>>>Thanks, >> >>>> >> >>>> >> >>>>Zhique >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> > >> > >> > >> > >> > >> > >> > >> >> > > > >

