Re: How to read metadata/content of an URL in URLFilter?

Renxia Wang Sun, 22 Feb 2015 20:46:25 -0800

Hi Majisha,

>From the source code of the URLFilter interface comments, the urlfilter is
called in the injector and db updater, which means that you do have the
data of the url you are processing in the the filter crawled.
You may want to take a look at this article, which illustrate the workflow
of Nutch, although it is for Nutch 1.4:
http://www.atlantbh.com/apache-nutch-overview/


Thanks,

Renxia

On Sun, Feb 22, 2015 at 8:30 PM, Majisha Parambath <[email protected]> wrote:

>
>
> My understanding is that the LinkDB or CrawlDB will contain the results of
> previously fetched and parsed pages.
> However if we want to get the contents of a URL/page in the URL Filtering
> stage( *which is not yet fetched*) , is there any util in Nutch  that we
> can use to fetch the contents of the page ?
>
> Thanks and regards,
> *Majisha Namath Parambath*
> *Graduate Student, M.S in Computer Science*
> *Viterbi School of Engineering*
> *University of Southern California, Los Angeles*
>
> On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980) <
> [email protected]> wrote:
>
>> In the constructor of your URLFilter, why not consider passing
>> in a NutchConfiguration object, and then reading the path to e.g,
>> the LinkDb from the config. Then have a private member variable
>> for the LinkDbReader (maybe static initialized for efficiency)
>> and use that in your interface method.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Renxia Wang <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Sunday, February 22, 2015 at 3:36 PM
>> To: "[email protected]" <[email protected]>
>> Subject: How to read metadata/content of an URL in URLFilter?
>>
>> >
>> >
>> >
>> >Hi
>> >
>> >
>> >I want to develop an UrlFIlter which takes an url, takes its metadata or
>> >even the fetched content, then use some duplicate detection algorithms to
>> >determine if it is a duplicate of any url in bitch. However, the only
>> >parameter passed into the Urlfilter
>> > is the url, is it possible to get the data I want of that input url in
>> >Urlfilter?
>> >
>> >
>> >Thanks,
>> >
>> >
>> >Zhique
>>
>>
>

Re: How to read metadata/content of an URL in URLFilter?

Reply via email to