[Nutch-dev] [Fwd: Re: get CrawlDatum]

Uroš Gruber Wed, 06 Sep 2006 10:47:06 -0700

A while ago I posted this on dev list but without reply. I wonder if 
this is right approach and If I continue to create this feature?
Do you think this idea would help nutch or maybe this is dead end and 
you've already talked about this.

regards

Uros

Andrzej Bialecki wrote:
> Uroš Gruber wrote:
>> ParseData.metadata sounds nice, but I think I'm lost again :)
>> If I understand code flow the best place would be in Fetcher [262]
>>
>> but i'm not sure that datum holds info of url being fetched
>
> On the input to the fetcher you get a URL and a CrawlDatum (originally 
> coming from the crawldb). Check for example how the segment name is 
> passed around in metadata, you can use the same method.
>
Hi,

I made some draft patch. But there is still some problems I see. I know 
code needs to be cleaned and test. But right now I don't know what 
number set to external urls. For internal linking works great.

What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop 
value is incremented by 1. (still no idea what to do with external 
link). Then I can add config value max_hop etc. to limit fetcher and 
generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [Fwd: Re: get CrawlDatum]

Reply via email to