Cool, good test. I thought the Nutch plugin system cached instances of plugins - I am not sure if it creates a new one each time. are you sure you don’t have the same URLFilter instance, it’s just called on different datasets and thus produces different counts?
Either way, so you should simply proceed with the filters in whatever form they are working in (cached or not). ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Renxia Wang <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Sunday, February 22, 2015 at 9:16 PM To: "[email protected]" <[email protected]> Subject: Re: How to read metadata/content of an URL in URLFilter? >I just added a counter in my URLFilter, and prove that the URLFilter >instances in each fetching circle are different. > > >Sample logs: >2015-02-22 21:07:10,636 INFO exactdup.ExactDupURLFilter - Processed 69 >links >2015-02-22 21:07:10,638 INFO exactdup.ExactDupURLFilter - Processed 70 >links >2015-02-22 21:07:10,640 INFO exactdup.ExactDupURLFilter - Processed 71 >links >2015-02-22 21:07:10,641 INFO exactdup.ExactDupURLFilter - Processed 72 >links >2015-02-22 21:07:10,643 INFO exactdup.ExactDupURLFilter - Processed 73 >links >2015-02-22 21:07:10,645 INFO exactdup.ExactDupURLFilter - Processed 74 >links >2015-02-22 21:07:10,647 INFO exactdup.ExactDupURLFilter - Processed 75 >links >2015-02-22 21:07:10,649 INFO exactdup.ExactDupURLFilter - Processed 76 >links >2015-02-22 21:07:10,650 INFO exactdup.ExactDupURLFilter - Processed 77 >links >2015-02-22 21:07:13,835 INFO exactdup.ExactDupURLFilter - Processed 1 >links >2015-02-22 21:07:13,850 INFO exactdup.ExactDupURLFilter - Processed 2 >links >2015-02-22 21:07:13,865 INFO exactdup.ExactDupURLFilter - Processed 3 >links >2015-02-22 21:07:13,878 INFO exactdup.ExactDupURLFilter - Processed 4 >links >2015-02-22 21:07:13,889 INFO exactdup.ExactDupURLFilter - Processed 5 >links >2015-02-22 21:07:13,899 INFO exactdup.ExactDupURLFilter - Processed 6 >links > > > >Not sure if it is configurable? > > > > >On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980) ><[email protected]> wrote: > >That’s one way - for sure - but what I was implying is that >you can train (read: feed data into) your model (read: algorithm) >using previously crawled information. So, no I wasn’t implying >machine learning. > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Renxia Wang <[email protected]> >Reply-To: "[email protected]" <[email protected]> >Date: Sunday, February 22, 2015 at 8:47 PM >To: "[email protected]" <[email protected]> >Subject: Re: How to read metadata/content of an URL in URLFilter? > >>Hi Prof Mattmann, >> >> >>You are saying "train" and "model", are we expected to use machine >>learning algorithms to train model for duplication detection? >> >> >>Thanks, >> >> >>Renxia >> >> >>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) >><[email protected]> wrote: >> >>There is nothing stating in your assignment that you can’t >>use *previously* crawled data to train your model - you >>should have at least 2 full sets of this. >> >>Cheers, >>Chris >> >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Chris Mattmann, Ph.D. >>Chief Architect >>Instrument Software and Science Data Systems Section (398) >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>Office: 168-519, Mailstop: 168-527 >>Email: [email protected] >>WWW: http://sunset.usc.edu/~mattmann/ >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Adjunct Associate Professor, Computer Science Department >>University of Southern California, Los Angeles, CA 90089 USA >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >>-----Original Message----- >>From: Majisha Parambath <[email protected]> >>Reply-To: "[email protected]" <[email protected]> >>Date: Sunday, February 22, 2015 at 8:30 PM >>To: dev <[email protected]> >>Subject: Re: How to read metadata/content of an URL in URLFilter? >> >>> >>> >>> >>>My understanding is that the LinkDB or CrawlDB will contain the results >>>of previously fetched and parsed pages. >>> >>>However if we want to get the contents of a URL/page in the URL >>>Filtering >>>stage( >>>which is not yet fetched) , is there any util in Nutch that we can use >>>to fetch the contents of the page ? >>> >>> >>>Thanks and regards, >>>Majisha Namath Parambath >>>Graduate Student, M.S in Computer Science >>>Viterbi School of Engineering >>>University of Southern California, Los Angeles >>> >>> >>> >>> >>> >>> >>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) >>><[email protected]> wrote: >>> >>>In the constructor of your URLFilter, why not consider passing >>>in a NutchConfiguration object, and then reading the path to e.g, >>>the LinkDb from the config. Then have a private member variable >>>for the LinkDbReader (maybe static initialized for efficiency) >>>and use that in your interface method. >>> >>>Cheers, >>>Chris >>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>Chris Mattmann, Ph.D. >>>Chief Architect >>>Instrument Software and Science Data Systems Section (398) >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>Office: 168-519, Mailstop: 168-527 >>>Email: [email protected] >>>WWW: http://sunset.usc.edu/~mattmann/ >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>Adjunct Associate Professor, Computer Science Department >>>University of Southern California, Los Angeles, CA 90089 USA >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>>-----Original Message----- >>>From: Renxia Wang <[email protected]> >>>Reply-To: "[email protected]" <[email protected]> >>>Date: Sunday, February 22, 2015 at 3:36 PM >>>To: "[email protected]" <[email protected]> >>>Subject: How to read metadata/content of an URL in URLFilter? >>> >>>> >>>> >>>> >>>>Hi >>>> >>>> >>>>I want to develop an UrlFIlter which takes an url, takes its metadata >>>>or >>>>even the fetched content, then use some duplicate detection algorithms >>>>to >>>>determine if it is a duplicate of any url in bitch. However, the only >>>>parameter passed into the Urlfilter >>>> is the url, is it possible to get the data I want of that input url in >>>>Urlfilter? >>>> >>>> >>>>Thanks, >>>> >>>> >>>>Zhique >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >> >> >> >> >> >> >> > > > > > > > >

