My understanding is that the LinkDB or CrawlDB will contain the results of previously fetched and parsed pages. However if we want to get the contents of a URL/page in the URL Filtering stage( *which is not yet fetched*) , is there any util in Nutch that we can use to fetch the contents of the page ?
Thanks and regards, *Majisha Namath Parambath* *Graduate Student, M.S in Computer Science* *Viterbi School of Engineering* *University of Southern California, Los Angeles* On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) < [email protected]> wrote: > In the constructor of your URLFilter, why not consider passing > in a NutchConfiguration object, and then reading the path to e.g, > the LinkDb from the config. Then have a private member variable > for the LinkDbReader (maybe static initialized for efficiency) > and use that in your interface method. > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Renxia Wang <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Sunday, February 22, 2015 at 3:36 PM > To: "[email protected]" <[email protected]> > Subject: How to read metadata/content of an URL in URLFilter? > > > > > > > > >Hi > > > > > >I want to develop an UrlFIlter which takes an url, takes its metadata or > >even the fetched content, then use some duplicate detection algorithms to > >determine if it is a duplicate of any url in bitch. However, the only > >parameter passed into the Urlfilter > > is the url, is it possible to get the data I want of that input url in > >Urlfilter? > > > > > >Thanks, > > > > > >Zhique > >

