That’s one way - for sure - but what I was implying is that you can train (read: feed data into) your model (read: algorithm) using previously crawled information. So, no I wasn’t implying machine learning.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Renxia Wang <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Sunday, February 22, 2015 at 8:47 PM To: "[email protected]" <[email protected]> Subject: Re: How to read metadata/content of an URL in URLFilter? >Hi Prof Mattmann, > > >You are saying "train" and "model", are we expected to use machine >learning algorithms to train model for duplication detection? > > >Thanks, > > >Renxia > > >On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) ><[email protected]> wrote: > >There is nothing stating in your assignment that you can’t >use *previously* crawled data to train your model - you >should have at least 2 full sets of this. > >Cheers, >Chris > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Chief Architect >Instrument Software and Science Data Systems Section (398) >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 168-519, Mailstop: 168-527 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Associate Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > >-----Original Message----- >From: Majisha Parambath <[email protected]> >Reply-To: "[email protected]" <[email protected]> >Date: Sunday, February 22, 2015 at 8:30 PM >To: dev <[email protected]> >Subject: Re: How to read metadata/content of an URL in URLFilter? > >> >> >> >>My understanding is that the LinkDB or CrawlDB will contain the results >>of previously fetched and parsed pages. >> >>However if we want to get the contents of a URL/page in the URL Filtering >>stage( >>which is not yet fetched) , is there any util in Nutch that we can use >>to fetch the contents of the page ? >> >> >>Thanks and regards, >>Majisha Namath Parambath >>Graduate Student, M.S in Computer Science >>Viterbi School of Engineering >>University of Southern California, Los Angeles >> >> >> >> >> >> >>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) >><[email protected]> wrote: >> >>In the constructor of your URLFilter, why not consider passing >>in a NutchConfiguration object, and then reading the path to e.g, >>the LinkDb from the config. Then have a private member variable >>for the LinkDbReader (maybe static initialized for efficiency) >>and use that in your interface method. >> >>Cheers, >>Chris >> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Chris Mattmann, Ph.D. >>Chief Architect >>Instrument Software and Science Data Systems Section (398) >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>Office: 168-519, Mailstop: 168-527 >>Email: [email protected] >>WWW: http://sunset.usc.edu/~mattmann/ >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>Adjunct Associate Professor, Computer Science Department >>University of Southern California, Los Angeles, CA 90089 USA >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> >>-----Original Message----- >>From: Renxia Wang <[email protected]> >>Reply-To: "[email protected]" <[email protected]> >>Date: Sunday, February 22, 2015 at 3:36 PM >>To: "[email protected]" <[email protected]> >>Subject: How to read metadata/content of an URL in URLFilter? >> >>> >>> >>> >>>Hi >>> >>> >>>I want to develop an UrlFIlter which takes an url, takes its metadata or >>>even the fetched content, then use some duplicate detection algorithms >>>to >>>determine if it is a duplicate of any url in bitch. However, the only >>>parameter passed into the Urlfilter >>> is the url, is it possible to get the data I want of that input url in >>>Urlfilter? >>> >>> >>>Thanks, >>> >>> >>>Zhique >> >> >> >> >> >> >> >> >> > > > > > > > >

