I just added a counter in my URLFilter, and prove that the URLFilter instances in each fetching circle are different.
Sample logs: 2015-02-22 21:07:10,636 INFO exactdup.ExactDupURLFilter - Processed 69 links 2015-02-22 21:07:10,638 INFO exactdup.ExactDupURLFilter - Processed 70 links 2015-02-22 21:07:10,640 INFO exactdup.ExactDupURLFilter - Processed 71 links 2015-02-22 21:07:10,641 INFO exactdup.ExactDupURLFilter - Processed 72 links 2015-02-22 21:07:10,643 INFO exactdup.ExactDupURLFilter - Processed 73 links 2015-02-22 21:07:10,645 INFO exactdup.ExactDupURLFilter - Processed 74 links 2015-02-22 21:07:10,647 INFO exactdup.ExactDupURLFilter - Processed 75 links 2015-02-22 21:07:10,649 INFO exactdup.ExactDupURLFilter - Processed 76 links 2015-02-22 21:07:10,650 INFO exactdup.ExactDupURLFilter - Processed 77 links 2015-02-22 21:07:13,835 INFO exactdup.ExactDupURLFilter - Processed 1 links 2015-02-22 21:07:13,850 INFO exactdup.ExactDupURLFilter - Processed 2 links 2015-02-22 21:07:13,865 INFO exactdup.ExactDupURLFilter - Processed 3 links 2015-02-22 21:07:13,878 INFO exactdup.ExactDupURLFilter - Processed 4 links 2015-02-22 21:07:13,889 INFO exactdup.ExactDupURLFilter - Processed 5 links 2015-02-22 21:07:13,899 INFO exactdup.ExactDupURLFilter - Processed 6 links Not sure if it is configurable? On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) < [email protected]> wrote: > That’s one way - for sure - but what I was implying is that > you can train (read: feed data into) your model (read: algorithm) > using previously crawled information. So, no I wasn’t implying > machine learning. > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Renxia Wang <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Sunday, February 22, 2015 at 8:47 PM > To: "[email protected]" <[email protected]> > Subject: Re: How to read metadata/content of an URL in URLFilter? > > >Hi Prof Mattmann, > > > > > >You are saying "train" and "model", are we expected to use machine > >learning algorithms to train model for duplication detection? > > > > > >Thanks, > > > > > >Renxia > > > > > >On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) > ><[email protected]> wrote: > > > >There is nothing stating in your assignment that you can’t > >use *previously* crawled data to train your model - you > >should have at least 2 full sets of this. > > > >Cheers, > >Chris > > > > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Chris Mattmann, Ph.D. > >Chief Architect > >Instrument Software and Science Data Systems Section (398) > >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >Office: 168-519, Mailstop: 168-527 > >Email: [email protected] > >WWW: http://sunset.usc.edu/~mattmann/ > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Adjunct Associate Professor, Computer Science Department > >University of Southern California, Los Angeles, CA 90089 USA > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > > > >-----Original Message----- > >From: Majisha Parambath <[email protected]> > >Reply-To: "[email protected]" <[email protected]> > >Date: Sunday, February 22, 2015 at 8:30 PM > >To: dev <[email protected]> > >Subject: Re: How to read metadata/content of an URL in URLFilter? > > > >> > >> > >> > >>My understanding is that the LinkDB or CrawlDB will contain the results > >>of previously fetched and parsed pages. > >> > >>However if we want to get the contents of a URL/page in the URL Filtering > >>stage( > >>which is not yet fetched) , is there any util in Nutch that we can use > >>to fetch the contents of the page ? > >> > >> > >>Thanks and regards, > >>Majisha Namath Parambath > >>Graduate Student, M.S in Computer Science > >>Viterbi School of Engineering > >>University of Southern California, Los Angeles > >> > >> > >> > >> > >> > >> > >>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) > >><[email protected]> wrote: > >> > >>In the constructor of your URLFilter, why not consider passing > >>in a NutchConfiguration object, and then reading the path to e.g, > >>the LinkDb from the config. Then have a private member variable > >>for the LinkDbReader (maybe static initialized for efficiency) > >>and use that in your interface method. > >> > >>Cheers, > >>Chris > >> > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>Chris Mattmann, Ph.D. > >>Chief Architect > >>Instrument Software and Science Data Systems Section (398) > >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >>Office: 168-519, Mailstop: 168-527 > >>Email: [email protected] > >>WWW: http://sunset.usc.edu/~mattmann/ > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >>Adjunct Associate Professor, Computer Science Department > >>University of Southern California, Los Angeles, CA 90089 USA > >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >> > >> > >> > >> > >> > >> > >>-----Original Message----- > >>From: Renxia Wang <[email protected]> > >>Reply-To: "[email protected]" <[email protected]> > >>Date: Sunday, February 22, 2015 at 3:36 PM > >>To: "[email protected]" <[email protected]> > >>Subject: How to read metadata/content of an URL in URLFilter? > >> > >>> > >>> > >>> > >>>Hi > >>> > >>> > >>>I want to develop an UrlFIlter which takes an url, takes its metadata or > >>>even the fetched content, then use some duplicate detection algorithms > >>>to > >>>determine if it is a duplicate of any url in bitch. However, the only > >>>parameter passed into the Urlfilter > >>> is the url, is it possible to get the data I want of that input url in > >>>Urlfilter? > >>> > >>> > >>>Thanks, > >>> > >>> > >>>Zhique > >> > >> > >> > >> > >> > >> > >> > >> > >> > > > > > > > > > > > > > > > > > >

