I log the instance id and get the result: 2015-02-22 21:42:15,972 INFO exactdup.ExactDupURLFilter - URlFilter ID: 423250256 2015-02-22 21:42:24,782 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 2015-02-22 21:42:24,795 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 2015-02-22 21:42:24,804 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 ... 2015-02-22 21:42:25,039 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 2015-02-22 21:42:25,041 INFO exactdup.ExactDupURLFilter - URlFilter ID: 828433560 2015-02-22 21:42:28,282 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1240209240 2015-02-22 21:42:28,292 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1240209240 ... 2015-02-22 21:42:28,487 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1240209240 2015-02-22 21:42:28,489 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1240209240 2015-02-22 21:42:43,984 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1818924295 2015-02-22 21:42:44,090 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1818924295 ... 2015-02-22 21:42:53,404 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1818924295 2015-02-22 21:44:08,533 INFO exactdup.ExactDupURLFilter - URlFilter ID: 908006650 2015-02-22 21:44:08,544 INFO exactdup.ExactDupURLFilter - URlFilter ID: 908006650 ... 2015-02-22 21:44:10,418 INFO exactdup.ExactDupURLFilter - URlFilter ID: 908006650 2015-02-22 21:44:10,420 INFO exactdup.ExactDupURLFilter - URlFilter ID: 908006650 2015-02-22 21:44:14,467 INFO exactdup.ExactDupURLFilter - URlFilter ID: 619451848 2015-02-22 21:44:14,478 INFO exactdup.ExactDupURLFilter - URlFilter ID: 619451848 ... 2015-02-22 21:44:15,643 INFO exactdup.ExactDupURLFilter - URlFilter ID: 619451848 2015-02-22 21:44:15,644 INFO exactdup.ExactDupURLFilter - URlFilter ID: 619451848 2015-02-22 21:44:26,189 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1343455839 2015-02-22 21:44:28,501 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1343455839 ... 2015-02-22 21:45:29,707 INFO exactdup.ExactDupURLFilter - URlFilter ID: 1343455839
As the url filters are called inthe injector and crawldb update, I grep: ➜ local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log 2015-02-22 21:42:14,896 INFO crawl.Injector - Injector: starting at *2015-02-22 21:42:14* Which means the URlFilter ID: 423250256 is the one created in the injector. ➜ local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log 2015-02-22 21:42:25,951 INFO crawl.CrawlDb - CrawlDb update: starting at 2015-02-22 21:42:25 2015-02-22 21:44:11,208 INFO crawl.CrawlDb - CrawlDb update: starting at 2015-02-22 21:44:11 Here is confusing, there are 6 unique urlfilter ids after injector, while there are only two crawldb update. On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) < [email protected]> wrote: > Cool, good test. I thought the Nutch plugin system cached instances > of plugins - I am not sure if it creates a new one each time. are you > sure you don’t have the same URLFilter instance, it’s just called on > different datasets and thus produces different counts? > > Either way, so you should simply proceed with the filters in whatever > form they are working in (cached or not). > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Renxia Wang <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Sunday, February 22, 2015 at 9:16 PM > To: "[email protected]" <[email protected]> > Subject: Re: How to read metadata/content of an URL in URLFilter? > > >I just added a counter in my URLFilter, and prove that the URLFilter > >instances in each fetching circle are different. > > > > > >Sample logs: > >2015-02-22 21:07:10,636 INFO exactdup.ExactDupURLFilter - Processed 69 > >links > >2015-02-22 21:07:10,638 INFO exactdup.ExactDupURLFilter - Processed 70 > >links > >2015-02-22 21:07:10,640 INFO exactdup.ExactDupURLFilter - Processed 71 > >links > >2015-02-22 21:07:10,641 INFO exactdup.ExactDupURLFilter - Processed 72 > >links > >2015-02-22 21:07:10,643 INFO exactdup.ExactDupURLFilter - Processed 73 > >links > >2015-02-22 21:07:10,645 INFO exactdup.ExactDupURLFilter - Processed 74 > >links > >2015-02-22 21:07:10,647 INFO exactdup.ExactDupURLFilter - Processed 75 > >links > >2015-02-22 21:07:10,649 INFO exactdup.ExactDupURLFilter - Processed 76 > >links > >2015-02-22 21:07:10,650 INFO exactdup.ExactDupURLFilter - Processed 77 > >links > >2015-02-22 21:07:13,835 INFO exactdup.ExactDupURLFilter - Processed 1 > >links > >2015-02-22 21:07:13,850 INFO exactdup.ExactDupURLFilter - Processed 2 > >links > >2015-02-22 21:07:13,865 INFO exactdup.ExactDupURLFilter - Processed 3 > >links > >2015-02-22 21:07:13,878 INFO exactdup.ExactDupURLFilter - Processed 4 > >links > >2015-02-22 21:07:13,889 INFO exactdup.ExactDupURLFilter - Processed 5 > >links > >2015-02-22 21:07:13,899 INFO exactdup.ExactDupURLFilter - Processed 6 > >links > > > > > > > >Not sure if it is configurable? > > > > > > > > > >On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) > ><[email protected]> wrote: > > > >That’s one way - for sure - but what I was implying is that > >you can train (read: feed data into) your model (read: algorithm) > >using previously crawled information. So, no I wasn’t implying > >machine learning. > > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Chris Mattmann, Ph.D. > >Chief Architect > >Instrument Software and Science Data Systems Section (398) > >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >Office: 168-519, Mailstop: 168-527 > >Email: [email protected] > >WWW: http://sunset.usc.edu/~mattmann/ > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Adjunct Associate Professor, Computer Science Department > >University of Southern California, Los Angeles, CA 90089 USA > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > > > >-----Original Message----- > >From: Renxia Wang <[email protected]> > >Reply-To: "[email protected]" <[email protected]> > >Date: Sunday, February 22, 2015 at 8:47 PM > >To: "[email protected]" <[email protected]> > >Subject: Re: How to read metadata/content of an URL in URLFilter? > > > >>Hi Prof Mattmann, > >> > >> > >>You are saying "train" and "model", are we expected to use machine > >>learning algorithms to train model for duplication detection? > >> > >> > >>Thanks, > >> > >> > >>Renxia > >> > >> > >>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) > >><[email protected]> wrote: > >> > >>There is nothing stating in your assignment that you can’t > >>use *previously* crawled data to train your model - you > >>should have at least 2 full sets of this. > >> > >>Cheers, > >>Chris > >> > >> > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>Chris Mattmann, Ph.D. > >>Chief Architect > >>Instrument Software and Science Data Systems Section (398) > >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>Office: 168-519, Mailstop: 168-527 > >>Email: [email protected] > >>WWW: http://sunset.usc.edu/~mattmann/ > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>Adjunct Associate Professor, Computer Science Department > >>University of Southern California, Los Angeles, CA 90089 USA > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> > >> > >> > >> > >>-----Original Message----- > >>From: Majisha Parambath <[email protected]> > >>Reply-To: "[email protected]" <[email protected]> > >>Date: Sunday, February 22, 2015 at 8:30 PM > >>To: dev <[email protected]> > >>Subject: Re: How to read metadata/content of an URL in URLFilter? > >> > >>> > >>> > >>> > >>>My understanding is that the LinkDB or CrawlDB will contain the results > >>>of previously fetched and parsed pages. > >>> > >>>However if we want to get the contents of a URL/page in the URL > >>>Filtering > >>>stage( > >>>which is not yet fetched) , is there any util in Nutch that we can use > >>>to fetch the contents of the page ? > >>> > >>> > >>>Thanks and regards, > >>>Majisha Namath Parambath > >>>Graduate Student, M.S in Computer Science > >>>Viterbi School of Engineering > >>>University of Southern California, Los Angeles > >>> > >>> > >>> > >>> > >>> > >>> > >>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) > >>><[email protected]> wrote: > >>> > >>>In the constructor of your URLFilter, why not consider passing > >>>in a NutchConfiguration object, and then reading the path to e.g, > >>>the LinkDb from the config. Then have a private member variable > >>>for the LinkDbReader (maybe static initialized for efficiency) > >>>and use that in your interface method. > >>> > >>>Cheers, > >>>Chris > >>> > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>Chris Mattmann, Ph.D. > >>>Chief Architect > >>>Instrument Software and Science Data Systems Section (398) > >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>>Office: 168-519, Mailstop: 168-527 > >>>Email: [email protected] > >>>WWW: http://sunset.usc.edu/~mattmann/ > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>>Adjunct Associate Professor, Computer Science Department > >>>University of Southern California, Los Angeles, CA 90089 USA > >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>> > >>> > >>> > >>> > >>> > >>> > >>>-----Original Message----- > >>>From: Renxia Wang <[email protected]> > >>>Reply-To: "[email protected]" <[email protected]> > >>>Date: Sunday, February 22, 2015 at 3:36 PM > >>>To: "[email protected]" <[email protected]> > >>>Subject: How to read metadata/content of an URL in URLFilter? > >>> > >>>> > >>>> > >>>> > >>>>Hi > >>>> > >>>> > >>>>I want to develop an UrlFIlter which takes an url, takes its metadata > >>>>or > >>>>even the fetched content, then use some duplicate detection algorithms > >>>>to > >>>>determine if it is a duplicate of any url in bitch. However, the only > >>>>parameter passed into the Urlfilter > >>>> is the url, is it possible to get the data I want of that input url in > >>>>Urlfilter? > >>>> > >>>> > >>>>Thanks, > >>>> > >>>> > >>>>Zhique > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >> > >> > >> > >> > >> > >> > >> > >> > > > > > > > > > > > > > > > > > >

