Re: [MASSMAIL]Re: How to read metadata/content of an URL in URLFilter?

Renxia Wang Mon, 23 Feb 2015 14:32:43 -0800

Thanks Jorge for your useful information.

So since there are multiple URLFilter instances being created during
crawling, is there any way to share data among them? Like a hashmap, which
may be useful to my purpose, duplicate detection. Or use a external
in-memory database?


I am also failed to get the path of linkdb/segments/crawldb. Here is what I
did:

I implemented the setConf and getConf methods in the urlfilter, and the
pass the conf to the LinkDbReader/CrawlDbReader/SegmentReader with a path.
There are so many properties in conf, like mapred.input.dir, but it
sometimes pointing to a specific version of linkdb and somtimes it points
to segments. I also tried to just hard code the path, but it throws null
pointer exception. I am particular interested in reading the parse_data in
segments since I need to parsed metadata. Any thought about getting this
work?

Thanks,

Zhique



On Mon, Feb 23, 2015 at 12:56 PM, Jorge Luis Betancourt González <
[email protected]> wrote:

> My two cents on the topic:
>
> The URLFilter family plugin are handled by the URLFilfters class, this
> class gets instantiated in several places in the source code, including the
> Fetcher and the Injector. The URLFilters class uses PluginRepository.get()
> method to load the plugins, this method indeed use a cache based on the
> UUID of the NutchConfiguration object passed as an argument, this generated
> UUID can be found inside the config object under the "nutch.conf.uuid" key,
> for what I can see in the NutchConfiguration class each time the create()
> method is called a new instance of the Configuration class is also created
> and a new UUID generated, the new UUID will cause a cache miss and a new
> PluginRepository will be created and cached.
>
> ------------------------------
> *From: *"Renxia Wang" <[email protected]>
> *To: *[email protected]
> *Sent: *Monday, February 23, 2015 1:00:30 AM
> *Subject: *[MASSMAIL]Re: How to read metadata/content of an URL in
> URLFilter?
>
>
> I log the instance id and get the result:
>
> 2015-02-22 21:42:15,972 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 423250256
> 2015-02-22 21:42:24,782 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> 2015-02-22 21:42:24,795 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> 2015-02-22 21:42:24,804 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> ...
> 2015-02-22 21:42:25,039 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> 2015-02-22 21:42:25,041 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 828433560
> 2015-02-22 21:42:28,282 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1240209240
> 2015-02-22 21:42:28,292 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1240209240
> ...
> 2015-02-22 21:42:28,487 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1240209240
> 2015-02-22 21:42:28,489 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1240209240
> 2015-02-22 21:42:43,984 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1818924295
> 2015-02-22 21:42:44,090 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1818924295
> ...
> 2015-02-22 21:42:53,404 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1818924295
> 2015-02-22 21:44:08,533 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 908006650
> 2015-02-22 21:44:08,544 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 908006650
> ...
> 2015-02-22 21:44:10,418 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 908006650
> 2015-02-22 21:44:10,420 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 908006650
> 2015-02-22 21:44:14,467 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 619451848
> 2015-02-22 21:44:14,478 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 619451848
> ...
> 2015-02-22 21:44:15,643 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 619451848
> 2015-02-22 21:44:15,644 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 619451848
> 2015-02-22 21:44:26,189 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1343455839
> 2015-02-22 21:44:28,501 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1343455839
> ...
> 2015-02-22 21:45:29,707 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
> 1343455839
>
> As the url filters are called inthe injector and crawldb update, I grep:
> ➜  local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log
> 2015-02-22 21:42:14,896 INFO  crawl.Injector - Injector: starting at 
> *2015-02-22
> 21:42:14*
>
> Which means the URlFilter ID: 423250256 is the one created in the
> injector.
>
> ➜  local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log
> 2015-02-22 21:42:25,951 INFO  crawl.CrawlDb - CrawlDb update: starting at
> 2015-02-22 21:42:25
> 2015-02-22 21:44:11,208 INFO  crawl.CrawlDb - CrawlDb update: starting at
> 2015-02-22 21:44:11
>
> Here is confusing, there are 6 unique urlfilter ids after injector, while
> there are only two crawldb update.
>
> On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) <
> [email protected]> wrote:
>
>> Cool, good test. I thought the Nutch plugin system cached instances
>> of plugins - I am not sure if it creates a new one each time. are you
>> sure you don’t have the same URLFilter instance, it’s just called on
>> different datasets and thus produces different counts?
>>
>> Either way, so you should simply proceed with the filters in whatever
>> form they are working in (cached or not).
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Renxia Wang <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Sunday, February 22, 2015 at 9:16 PM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: How to read metadata/content of an URL in URLFilter?
>>
>> >I just added a counter in my URLFilter, and prove that the URLFilter
>> >instances in each fetching circle are different.
>> >
>> >
>> >Sample logs:
>> >2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
>> >links
>> >2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
>> >links
>> >2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
>> >links
>> >2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
>> >links
>> >2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
>> >links
>> >2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
>> >links
>> >2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
>> >links
>> >2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
>> >links
>> >2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
>> >links
>> >2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1
>> >links
>> >2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2
>> >links
>> >2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3
>> >links
>> >2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4
>> >links
>> >2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5
>> >links
>> >2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6
>> >links
>> >
>> >
>> >
>> >Not sure if it is configurable?
>> >
>> >
>> >
>> >
>> >On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980)
>> ><[email protected]> wrote:
>> >
>> >That’s one way - for sure - but what I was implying is that
>> >you can train (read: feed data into) your model (read: algorithm)
>> >using previously crawled information. So, no I wasn’t implying
>> >machine learning.
>> >
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >Chris Mattmann, Ph.D.
>> >Chief Architect
>> >Instrument Software and Science Data Systems Section (398)
>> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >Office: 168-519, Mailstop: 168-527
>> >Email: [email protected]
>> >WWW:  http://sunset.usc.edu/~mattmann/
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >Adjunct Associate Professor, Computer Science Department
>> >University of Southern California, Los Angeles, CA 90089 USA
>> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >
>> >
>> >
>> >
>> >
>> >
>> >-----Original Message-----
>> >From: Renxia Wang <[email protected]>
>> >Reply-To: "[email protected]" <[email protected]>
>> >Date: Sunday, February 22, 2015 at 8:47 PM
>> >To: "[email protected]" <[email protected]>
>> >Subject: Re: How to read metadata/content of an URL in URLFilter?
>> >
>> >>Hi Prof Mattmann,
>> >>
>> >>
>> >>You are saying "train" and "model", are we expected to use machine
>> >>learning algorithms to train model for duplication detection?
>> >>
>> >>
>> >>Thanks,
>> >>
>> >>
>> >>Renxia
>> >>
>> >>
>> >>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
>> >><[email protected]> wrote:
>> >>
>> >>There is nothing stating in your assignment that you can’t
>> >>use *previously* crawled data to train your model - you
>> >>should have at least 2 full sets of this.
>> >>
>> >>Cheers,
>> >>Chris
>> >>
>> >>
>> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>Chris Mattmann, Ph.D.
>> >>Chief Architect
>> >>Instrument Software and Science Data Systems Section (398)
>> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>Office: 168-519, Mailstop: 168-527
>> >>Email: [email protected]
>> >>WWW:  http://sunset.usc.edu/~mattmann/
>> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>Adjunct Associate Professor, Computer Science Department
>> >>University of Southern California, Los Angeles, CA 90089 USA
>> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>-----Original Message-----
>> >>From: Majisha Parambath <[email protected]>
>> >>Reply-To: "[email protected]" <[email protected]>
>> >>Date: Sunday, February 22, 2015 at 8:30 PM
>> >>To: dev <[email protected]>
>> >>Subject: Re: How to read metadata/content of an URL in URLFilter?
>> >>
>> >>>
>> >>>
>> >>>
>> >>>My understanding is that the LinkDB or CrawlDB will contain the results
>> >>>of previously fetched and parsed pages.
>> >>>
>> >>>However if we want to get the contents of a URL/page in the URL
>> >>>Filtering
>> >>>stage(
>> >>>which is not yet fetched) , is there any util in Nutch  that we can use
>> >>>to fetch the contents of the page ?
>> >>>
>> >>>
>> >>>Thanks and regards,
>> >>>Majisha Namath Parambath
>> >>>Graduate Student, M.S in Computer Science
>> >>>Viterbi School of Engineering
>> >>>University of Southern California, Los Angeles
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
>> >>><[email protected]> wrote:
>> >>>
>> >>>In the constructor of your URLFilter, why not consider passing
>> >>>in a NutchConfiguration object, and then reading the path to e.g,
>> >>>the LinkDb from the config. Then have a private member variable
>> >>>for the LinkDbReader (maybe static initialized for efficiency)
>> >>>and use that in your interface method.
>> >>>
>> >>>Cheers,
>> >>>Chris
>> >>>
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>Chris Mattmann, Ph.D.
>> >>>Chief Architect
>> >>>Instrument Software and Science Data Systems Section (398)
>> >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> >>>Office: 168-519, Mailstop: 168-527
>> >>>Email: [email protected]
>> >>>WWW:  http://sunset.usc.edu/~mattmann/
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>Adjunct Associate Professor, Computer Science Department
>> >>>University of Southern California, Los Angeles, CA 90089 USA
>> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>-----Original Message-----
>> >>>From: Renxia Wang <[email protected]>
>> >>>Reply-To: "[email protected]" <[email protected]>
>> >>>Date: Sunday, February 22, 2015 at 3:36 PM
>> >>>To: "[email protected]" <[email protected]>
>> >>>Subject: How to read metadata/content of an URL in URLFilter?
>> >>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>Hi
>> >>>>
>> >>>>
>> >>>>I want to develop an UrlFIlter which takes an url, takes its metadata
>> >>>>or
>> >>>>even the fetched content, then use some duplicate detection algorithms
>> >>>>to
>> >>>>determine if it is a duplicate of any url in bitch. However, the only
>> >>>>parameter passed into the Urlfilter
>> >>>> is the url, is it possible to get the data I want of that input url
>> in
>> >>>>Urlfilter?
>> >>>>
>> >>>>
>> >>>>Thanks,
>> >>>>
>> >>>>
>> >>>>Zhique
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>
>
>
>

Re: [MASSMAIL]Re: How to read metadata/content of an URL in URLFilter?

Reply via email to