Hi Prof Mattmann,

You are saying "train" and "model", are we expected to use machine learning
algorithms to train model for duplication detection?

Thanks,

Renxia

On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> There is nothing stating in your assignment that you can’t
> use *previously* crawled data to train your model - you
> should have at least 2 full sets of this.
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Majisha Parambath <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Sunday, February 22, 2015 at 8:30 PM
> To: dev <[email protected]>
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >
> >
> >
> >My understanding is that the LinkDB or CrawlDB will contain the results
> >of previously fetched and parsed pages.
> >
> >However if we want to get the contents of a URL/page in the URL Filtering
> >stage(
> >which is not yet fetched) , is there any util in Nutch  that we can use
> >to fetch the contents of the page ?
> >
> >
> >Thanks and regards,
> >Majisha Namath Parambath
> >Graduate Student, M.S in Computer Science
> >Viterbi School of Engineering
> >University of Southern California, Los Angeles
> >
> >
> >
> >
> >
> >
> >On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> ><[email protected]> wrote:
> >
> >In the constructor of your URLFilter, why not consider passing
> >in a NutchConfiguration object, and then reading the path to e.g,
> >the LinkDb from the config. Then have a private member variable
> >for the LinkDbReader (maybe static initialized for efficiency)
> >and use that in your interface method.
> >
> >Cheers,
> >Chris
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: [email protected]
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: Renxia Wang <[email protected]>
> >Reply-To: "[email protected]" <[email protected]>
> >Date: Sunday, February 22, 2015 at 3:36 PM
> >To: "[email protected]" <[email protected]>
> >Subject: How to read metadata/content of an URL in URLFilter?
> >
> >>
> >>
> >>
> >>Hi
> >>
> >>
> >>I want to develop an UrlFIlter which takes an url, takes its metadata or
> >>even the fetched content, then use some duplicate detection algorithms to
> >>determine if it is a duplicate of any url in bitch. However, the only
> >>parameter passed into the Urlfilter
> >> is the url, is it possible to get the data I want of that input url in
> >>Urlfilter?
> >>
> >>
> >>Thanks,
> >>
> >>
> >>Zhique
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Reply via email to