Re: How to read metadata/content of an URL in URLFilter?

Renxia Wang Sun, 22 Feb 2015 20:49:04 -0800

Hi Prof Mattmann,

You are saying "train" and "model", are we expected to use machine learning
algorithms to train model for duplication detection?


Thanks,

Renxia

On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> There is nothing stating in your assignment that you can’t
> use *previously* crawled data to train your model - you
> should have at least 2 full sets of this.
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Majisha Parambath <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Sunday, February 22, 2015 at 8:30 PM
> To: dev <[email protected]>
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >My understanding is that the LinkDB or CrawlDB will contain the results
> >of previously fetched and parsed pages.
> >
> >However if we want to get the contents of a URL/page in the URL Filtering
> >stage(
> >which is not yet fetched) , is there any util in Nutch  that we can use
> >to fetch the contents of the page ?
> >
> >
> >Thanks and regards,
> >Majisha Namath Parambath
> >Graduate Student, M.S in Computer Science
> >Viterbi School of Engineering
> >University of Southern California, Los Angeles
> >
> >
> >
> >
> >
> >
> >On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> ><[email protected]> wrote:
> >
> >In the constructor of your URLFilter, why not consider passing
> >in a NutchConfiguration object, and then reading the path to e.g,
> >the LinkDb from the config. Then have a private member variable
> >for the LinkDbReader (maybe static initialized for efficiency)
> >and use that in your interface method.
> >
> >Cheers,
> >Chris
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: [email protected]
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: Renxia Wang <[email protected]>
> >Reply-To: "[email protected]" <[email protected]>
> >Date: Sunday, February 22, 2015 at 3:36 PM
> >To: "[email protected]" <[email protected]>
> >Subject: How to read metadata/content of an URL in URLFilter?
> >
> >>
> >>
> >>
> >>Hi
> >>
> >>
> >>I want to develop an UrlFIlter which takes an url, takes its metadata or
> >>even the fetched content, then use some duplicate detection algorithms to
> >>determine if it is a duplicate of any url in bitch. However, the only
> >>parameter passed into the Urlfilter
> >> is the url, is it possible to get the data I want of that input url in
> >>Urlfilter?
> >>
> >>
> >>Thanks,
> >>
> >>
> >>Zhique
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: How to read metadata/content of an URL in URLFilter?

Reply via email to