Re: How to read metadata/content of an URL in URLFilter?

Majisha Parambath Sun, 22 Feb 2015 20:33:04 -0800

My understanding is that the LinkDB or CrawlDB will contain the results of
previously fetched and parsed pages.
However if we want to get the contents of a URL/page in the URL Filtering
stage( *which is not yet fetched*) , is there any util in Nutch  that we
can use to fetch the contents of the page ?


Thanks and regards,
*Majisha Namath Parambath*
*Graduate Student, M.S in Computer Science*
*Viterbi School of Engineering*
*University of Southern California, Los Angeles*

On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> In the constructor of your URLFilter, why not consider passing
> in a NutchConfiguration object, and then reading the path to e.g,
> the LinkDb from the config. Then have a private member variable
> for the LinkDbReader (maybe static initialized for efficiency)
> and use that in your interface method.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Renxia Wang <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Sunday, February 22, 2015 at 3:36 PM
> To: "[email protected]" <[email protected]>
> Subject: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >Hi
> >
> >
> >I want to develop an UrlFIlter which takes an url, takes its metadata or
> >even the fetched content, then use some duplicate detection algorithms to
> >determine if it is a duplicate of any url in bitch. However, the only
> >parameter passed into the Urlfilter
> > is the url, is it possible to get the data I want of that input url in
> >Urlfilter?
> >
> >
> >Thanks,
> >
> >
> >Zhique
>
>

Re: How to read metadata/content of an URL in URLFilter?

Reply via email to