Re: How to read metadata/content of an URL in URLFilter?

Renxia Wang Sun, 22 Feb 2015 22:05:06 -0800

I log the instance id and get the result:

2015-02-22 21:42:15,972 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
423250256
2015-02-22 21:42:24,782 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:24,795 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:24,804 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
...
2015-02-22 21:42:25,039 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:25,041 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
828433560
2015-02-22 21:42:28,282 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:28,292 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
...
2015-02-22 21:42:28,487 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:28,489 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1240209240
2015-02-22 21:42:43,984 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
2015-02-22 21:42:44,090 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
...
2015-02-22 21:42:53,404 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1818924295
2015-02-22 21:44:08,533 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:08,544 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
...
2015-02-22 21:44:10,418 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:10,420 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
908006650
2015-02-22 21:44:14,467 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:14,478 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
...
2015-02-22 21:44:15,643 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:15,644 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
619451848
2015-02-22 21:44:26,189 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839
2015-02-22 21:44:28,501 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839
...
2015-02-22 21:45:29,707 INFO  exactdup.ExactDupURLFilter - URlFilter ID:
1343455839


As the url filters are called inthe injector and crawldb update, I grep:
➜  local git:(trunk) ✗ grep 'Injector: starting at' logs/hadoop.log
2015-02-22 21:42:14,896 INFO  crawl.Injector - Injector: starting at
*2015-02-22
21:42:14*

Which means the URlFilter ID: 423250256 is the one created in the injector.

➜  local git:(trunk) ✗ grep 'CrawlDb update: starting at' logs/hadoop.log
2015-02-22 21:42:25,951 INFO  crawl.CrawlDb - CrawlDb update: starting at
2015-02-22 21:42:25
2015-02-22 21:44:11,208 INFO  crawl.CrawlDb - CrawlDb update: starting at
2015-02-22 21:44:11

Here is confusing, there are 6 unique urlfilter ids after injector, while
there are only two crawldb update.

On Sun, Feb 22, 2015 at 9:24 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Cool, good test. I thought the Nutch plugin system cached instances
> of plugins - I am not sure if it creates a new one each time. are you
> sure you don’t have the same URLFilter instance, it’s just called on
> different datasets and thus produces different counts?
>
> Either way, so you should simply proceed with the filters in whatever
> form they are working in (cached or not).
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Renxia Wang <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Sunday, February 22, 2015 at 9:16 PM
> To: "[email protected]" <[email protected]>
> Subject: Re: How to read metadata/content of an URL in URLFilter?
>
> >I just added a counter in my URLFilter, and prove that the URLFilter
> >instances in each fetching circle are different.
> >
> >
> >Sample logs:
> >2015-02-22 21:07:10,636 INFO  exactdup.ExactDupURLFilter - Processed 69
> >links
> >2015-02-22 21:07:10,638 INFO  exactdup.ExactDupURLFilter - Processed 70
> >links
> >2015-02-22 21:07:10,640 INFO  exactdup.ExactDupURLFilter - Processed 71
> >links
> >2015-02-22 21:07:10,641 INFO  exactdup.ExactDupURLFilter - Processed 72
> >links
> >2015-02-22 21:07:10,643 INFO  exactdup.ExactDupURLFilter - Processed 73
> >links
> >2015-02-22 21:07:10,645 INFO  exactdup.ExactDupURLFilter - Processed 74
> >links
> >2015-02-22 21:07:10,647 INFO  exactdup.ExactDupURLFilter - Processed 75
> >links
> >2015-02-22 21:07:10,649 INFO  exactdup.ExactDupURLFilter - Processed 76
> >links
> >2015-02-22 21:07:10,650 INFO  exactdup.ExactDupURLFilter - Processed 77
> >links
> >2015-02-22 21:07:13,835 INFO  exactdup.ExactDupURLFilter - Processed 1
> >links
> >2015-02-22 21:07:13,850 INFO  exactdup.ExactDupURLFilter - Processed 2
> >links
> >2015-02-22 21:07:13,865 INFO  exactdup.ExactDupURLFilter - Processed 3
> >links
> >2015-02-22 21:07:13,878 INFO  exactdup.ExactDupURLFilter - Processed 4
> >links
> >2015-02-22 21:07:13,889 INFO  exactdup.ExactDupURLFilter - Processed 5
> >links
> >2015-02-22 21:07:13,899 INFO  exactdup.ExactDupURLFilter - Processed 6
> >links
> >
> >
> >
> >Not sure if it is configurable?
> >
> >
> >
> >
> >On Sun, Feb 22, 2015 at 8:56 PM, Mattmann, Chris A (3980)
> ><[email protected]> wrote:
> >
> >That’s one way - for sure - but what I was implying is that
> >you can train (read: feed data into) your model (read: algorithm)
> >using previously crawled information. So, no I wasn’t implying
> >machine learning.
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: [email protected]
> >WWW:  http://sunset.usc.edu/~mattmann/
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: Renxia Wang <[email protected]>
> >Reply-To: "[email protected]" <[email protected]>
> >Date: Sunday, February 22, 2015 at 8:47 PM
> >To: "[email protected]" <[email protected]>
> >Subject: Re: How to read metadata/content of an URL in URLFilter?
> >
> >>Hi Prof Mattmann,
> >>
> >>
> >>You are saying "train" and "model", are we expected to use machine
> >>learning algorithms to train model for duplication detection?
> >>
> >>
> >>Thanks,
> >>
> >>
> >>Renxia
> >>
> >>
> >>On Sun, Feb 22, 2015 at 8:39 PM, Mattmann, Chris A (3980)
> >><[email protected]> wrote:
> >>
> >>There is nothing stating in your assignment that you can’t
> >>use *previously* crawled data to train your model - you
> >>should have at least 2 full sets of this.
> >>
> >>Cheers,
> >>Chris
> >>
> >>
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Chris Mattmann, Ph.D.
> >>Chief Architect
> >>Instrument Software and Science Data Systems Section (398)
> >>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>Office: 168-519, Mailstop: 168-527
> >>Email: [email protected]
> >>WWW:  http://sunset.usc.edu/~mattmann/
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>Adjunct Associate Professor, Computer Science Department
> >>University of Southern California, Los Angeles, CA 90089 USA
> >>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>
> >>
> >>
> >>
> >>
> >>
> >>-----Original Message-----
> >>From: Majisha Parambath <[email protected]>
> >>Reply-To: "[email protected]" <[email protected]>
> >>Date: Sunday, February 22, 2015 at 8:30 PM
> >>To: dev <[email protected]>
> >>Subject: Re: How to read metadata/content of an URL in URLFilter?
> >>
> >>>
> >>>
> >>>
> >>>My understanding is that the LinkDB or CrawlDB will contain the results
> >>>of previously fetched and parsed pages.
> >>>
> >>>However if we want to get the contents of a URL/page in the URL
> >>>Filtering
> >>>stage(
> >>>which is not yet fetched) , is there any util in Nutch  that we can use
> >>>to fetch the contents of the page ?
> >>>
> >>>
> >>>Thanks and regards,
> >>>Majisha Namath Parambath
> >>>Graduate Student, M.S in Computer Science
> >>>Viterbi School of Engineering
> >>>University of Southern California, Los Angeles
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>On Sun, Feb 22, 2015 at 4:53 PM, Mattmann, Chris A (3980)
> >>><[email protected]> wrote:
> >>>
> >>>In the constructor of your URLFilter, why not consider passing
> >>>in a NutchConfiguration object, and then reading the path to e.g,
> >>>the LinkDb from the config. Then have a private member variable
> >>>for the LinkDbReader (maybe static initialized for efficiency)
> >>>and use that in your interface method.
> >>>
> >>>Cheers,
> >>>Chris
> >>>
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>Chris Mattmann, Ph.D.
> >>>Chief Architect
> >>>Instrument Software and Science Data Systems Section (398)
> >>>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >>>Office: 168-519, Mailstop: 168-527
> >>>Email: [email protected]
> >>>WWW:  http://sunset.usc.edu/~mattmann/
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>Adjunct Associate Professor, Computer Science Department
> >>>University of Southern California, Los Angeles, CA 90089 USA
> >>>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>-----Original Message-----
> >>>From: Renxia Wang <[email protected]>
> >>>Reply-To: "[email protected]" <[email protected]>
> >>>Date: Sunday, February 22, 2015 at 3:36 PM
> >>>To: "[email protected]" <[email protected]>
> >>>Subject: How to read metadata/content of an URL in URLFilter?
> >>>
> >>>>
> >>>>
> >>>>
> >>>>Hi
> >>>>
> >>>>
> >>>>I want to develop an UrlFIlter which takes an url, takes its metadata
> >>>>or
> >>>>even the fetched content, then use some duplicate detection algorithms
> >>>>to
> >>>>determine if it is a duplicate of any url in bitch. However, the only
> >>>>parameter passed into the Urlfilter
> >>>> is the url, is it possible to get the data I want of that input url in
> >>>>Urlfilter?
> >>>>
> >>>>
> >>>>Thanks,
> >>>>
> >>>>
> >>>>Zhique
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: How to read metadata/content of an URL in URLFilter?

Reply via email to