A while ago I posted this on dev list but without reply. I wonder if this is right approach and If I continue to create this feature? Do you think this idea would help nutch or maybe this is dead end and you've already talked about this.

regards

Uros

Andrzej Bialecki wrote:
Uroš Gruber wrote:
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]

but i'm not sure that datum holds info of url being fetched

On the input to the fetcher you get a URL and a CrawlDatum (originally coming from the crawldb). Check for example how the segment name is passed around in metadata, you can use the same method.

Hi,

I made some draft patch. But there is still some problems I see. I know code needs to be cleaned and test. But right now I don't know what number set to external urls. For internal linking works great.

What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop value is incremented by 1. (still no idea what to do with external link). Then I can add config value max_hop etc. to limit fetcher and generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.



Reply via email to