Re: How to read metadata/content of an URL in URLFilter?

Mattmann, Chris A (3980) Sun, 22 Feb 2015 21:01:25 -0800

That’s one way - for sure - but what I was implying is that
you can train (read: feed data into) your model (read: algorithm)
using previously crawled information. So, no I wasn’t implying
machine learning.


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Renxia Wang <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, February 22, 2015 at 8:47 PM
To: "[email protected]" <[email protected]>
Subject: Re: How to read metadata/content of an URL in URLFilter?

>Hi Prof Mattmann,
>
>
>You are saying "train" and "model", are we expected to use machine
>learning algorithms to train model for duplication detection?
>
>
>Thanks,
>
>
>Renxia
>
>
>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
><[email protected]> wrote:
>
>There is nothing stating in your assignment that you can’t
>use *previously* crawled data to train your model - you
>should have at least 2 full sets of this.
>
>Cheers,
>Chris
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: [email protected]
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Majisha Parambath <[email protected]>
>Reply-To: "[email protected]" <[email protected]>
>Date: Sunday, February 22, 2015 at 8:30 PM
>To: dev <[email protected]>
>Subject: Re: How to read metadata/content of an URL in URLFilter?
>
>>
>>
>>
>>My understanding is that the LinkDB or CrawlDB will contain the results
>>of previously fetched and parsed pages.
>>
>>However if we want to get the contents of a URL/page in the URL Filtering
>>stage(
>>which is not yet fetched) , is there any util in Nutch  that we can use
>>to fetch the contents of the page ?
>>
>>
>>Thanks and regards,
>>Majisha Namath Parambath
>>Graduate Student, M.S in Computer Science
>>Viterbi School of Engineering
>>University of Southern California, Los Angeles
>>
>>
>>
>>
>>
>>
>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
>><[email protected]> wrote:
>>
>>In the constructor of your URLFilter, why not consider passing
>>in a NutchConfiguration object, and then reading the path to e.g,
>>the LinkDb from the config. Then have a private member variable
>>for the LinkDbReader (maybe static initialized for efficiency)
>>and use that in your interface method.
>>
>>Cheers,
>>Chris
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: [email protected]
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Renxia Wang <[email protected]>
>>Reply-To: "[email protected]" <[email protected]>
>>Date: Sunday, February 22, 2015 at 3:36 PM
>>To: "[email protected]" <[email protected]>
>>Subject: How to read metadata/content of an URL in URLFilter?
>>
>>>
>>>
>>>
>>>Hi
>>>
>>>
>>>I want to develop an UrlFIlter which takes an url, takes its metadata or
>>>even the fetched content, then use some duplicate detection algorithms
>>>to
>>>determine if it is a duplicate of any url in bitch. However, the only
>>>parameter passed into the Urlfilter
>>> is the url, is it possible to get the data I want of that input url in
>>>Urlfilter?
>>>
>>>
>>>Thanks,
>>>
>>>
>>>Zhique
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

Reply via email to