Re: How to read metadata/content of an URL in URLFilter?

Mattmann, Chris A (3980) Sun, 22 Feb 2015 21:27:51 -0800

Cool, good test. I thought the Nutch plugin system cached instances
of plugins - I am not sure if it creates a new one each time. are you
sure you don’t have the same URLFilter instance, it’s just called on
different datasets and thus produces different counts?


Either way, so you should simply proceed with the filters in whatever
form they are working in (cached or not).

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: Renxia Wang <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Sunday, February 22, 2015 at 9:16 PM
To: "[email protected]" <[email protected]>
Subject: Re: How to read metadata/content of an URL in URLFilter?

>I just added a counter in my URLFilter, and prove that the URLFilter
>instances in each fetching circle are different.
>
>
>Sample logs:
>2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
>links
>2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
>links
>2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
>links
>2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
>links
>2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
>links
>2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
>links
>2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
>links
>2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
>links
>2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
>links
>2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1
>links
>2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2
>links
>2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3
>links
>2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4
>links
>2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5
>links
>2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6
>links
>
>
>
>Not sure if it is configurable?
>
>
>
>
>On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980)
><[email protected]> wrote:
>
>That’s one way - for sure - but what I was implying is that
>you can train (read: feed data into) your model (read: algorithm)
>using previously crawled information. So, no I wasn’t implying
>machine learning.
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: [email protected]
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
>-----Original Message-----
>From: Renxia Wang <[email protected]>
>Reply-To: "[email protected]" <[email protected]>
>Date: Sunday, February 22, 2015 at 8:47 PM
>To: "[email protected]" <[email protected]>
>Subject: Re: How to read metadata/content of an URL in URLFilter?
>
>>Hi Prof Mattmann,
>>
>>
>>You are saying "train" and "model", are we expected to use machine
>>learning algorithms to train model for duplication detection?
>>
>>
>>Thanks,
>>
>>
>>Renxia
>>
>>
>>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
>><[email protected]> wrote:
>>
>>There is nothing stating in your assignment that you can’t
>>use *previously* crawled data to train your model - you
>>should have at least 2 full sets of this.
>>
>>Cheers,
>>Chris
>>
>>
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Chris Mattmann, Ph.D.
>>Chief Architect
>>Instrument Software and Science Data Systems Section (398)
>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>Office: 168-519, Mailstop: 168-527
>>Email: [email protected]
>>WWW:  http://sunset.usc.edu/~mattmann/
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>Adjunct Associate Professor, Computer Science Department
>>University of Southern California, Los Angeles, CA 90089 USA
>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Majisha Parambath <[email protected]>
>>Reply-To: "[email protected]" <[email protected]>
>>Date: Sunday, February 22, 2015 at 8:30 PM
>>To: dev <[email protected]>
>>Subject: Re: How to read metadata/content of an URL in URLFilter?
>>
>>>
>>>
>>>
>>>My understanding is that the LinkDB or CrawlDB will contain the results
>>>of previously fetched and parsed pages.
>>>
>>>However if we want to get the contents of a URL/page in the URL
>>>Filtering
>>>stage(
>>>which is not yet fetched) , is there any util in Nutch  that we can use
>>>to fetch the contents of the page ?
>>>
>>>
>>>Thanks and regards,
>>>Majisha Namath Parambath
>>>Graduate Student, M.S in Computer Science
>>>Viterbi School of Engineering
>>>University of Southern California, Los Angeles
>>>
>>>
>>>
>>>
>>>
>>>
>>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
>>><[email protected]> wrote:
>>>
>>>In the constructor of your URLFilter, why not consider passing
>>>in a NutchConfiguration object, and then reading the path to e.g,
>>>the LinkDb from the config. Then have a private member variable
>>>for the LinkDbReader (maybe static initialized for efficiency)
>>>and use that in your interface method.
>>>
>>>Cheers,
>>>Chris
>>>
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Chris Mattmann, Ph.D.
>>>Chief Architect
>>>Instrument Software and Science Data Systems Section (398)
>>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>Office: 168-519, Mailstop: 168-527
>>>Email: [email protected]
>>>WWW:  http://sunset.usc.edu/~mattmann/
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>Adjunct Associate Professor, Computer Science Department
>>>University of Southern California, Los Angeles, CA 90089 USA
>>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>
>>>
>>>
>>>
>>>
>>>
>>>-----Original Message-----
>>>From: Renxia Wang <[email protected]>
>>>Reply-To: "[email protected]" <[email protected]>
>>>Date: Sunday, February 22, 2015 at 3:36 PM
>>>To: "[email protected]" <[email protected]>
>>>Subject: How to read metadata/content of an URL in URLFilter?
>>>
>>>>
>>>>
>>>>
>>>>Hi
>>>>
>>>>
>>>>I want to develop an UrlFIlter which takes an url, takes its metadata
>>>>or
>>>>even the fetched content, then use some duplicate detection algorithms
>>>>to
>>>>determine if it is a duplicate of any url in bitch. However, the only
>>>>parameter passed into the Urlfilter
>>>> is the url, is it possible to get the data I want of that input url in
>>>>Urlfilter?
>>>>
>>>>
>>>>Thanks,
>>>>
>>>>
>>>>Zhique
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>
>
>
>

Re: How to read metadata/content of an URL in URLFilter?

Reply via email to