Re: How to read metadata/content of an URL in URLFilter?

Mattmann, Chris A (3980) Sun, 22 Feb 2015 20:42:18 -0800

There is nothing stating in your assignment that you can’t
use *previously* crawled data to train your model - you
should have at least 2 full sets of this.


Cheers,
Chris


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Majisha Parambath <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, February 22, 2015 at 8:30 PM
To: dev <[email protected]>
Subject: Re: How to read metadata/content of an URL in URLFilter?

>
>
>
>My understanding is that the LinkDB or CrawlDB will contain the results
>of previously fetched and parsed pages.
>
>However if we want to get the contents of a URL/page in the URL Filtering
>stage(
>which is not yet fetched) , is there any util in Nutch  that we can use
>to fetch the contents of the page ?
>
>
>Thanks and regards,
>Majisha Namath Parambath
>Graduate Student, M.S in Computer Science
>Viterbi School of Engineering
>University of Southern California, Los Angeles
>
>
>
>
>
>
>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
><[email protected]> wrote:
>
>In the constructor of your URLFilter, why not consider passing
>in a NutchConfiguration object, and then reading the path to e.g,
>the LinkDb from the config. Then have a private member variable
>for the LinkDbReader (maybe static initialized for efficiency)
>and use that in your interface method.
>
>Cheers,
>Chris
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: [email protected]
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Renxia Wang <[email protected]>
>Reply-To: "[email protected]" <[email protected]>
>Date: Sunday, February 22, 2015 at 3:36 PM
>To: "[email protected]" <[email protected]>
>Subject: How to read metadata/content of an URL in URLFilter?
>
>>
>>
>>
>>Hi
>>
>>
>>I want to develop an UrlFIlter which takes an url, takes its metadata or
>>even the fetched content, then use some duplicate detection algorithms to
>>determine if it is a duplicate of any url in bitch. However, the only
>>parameter passed into the Urlfilter
>> is the url, is it possible to get the data I want of that input url in
>>Urlfilter?
>>
>>
>>Thanks,
>>
>>
>>Zhique
>
>
>
>
>
>
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

Reply via email to