A while ago I posted this on dev list but without reply. I wonder if
this is right approach and If I continue to create this feature?
Do you think this idea would help nutch or maybe this is dead end and
you've already talked about this.
regards
Uros
Andrzej Bialecki wrote:
Uroš Gruber wrote:
ParseData.metadata sounds nice, but I think I'm lost again :)
If I understand code flow the best place would be in Fetcher [262]
but i'm not sure that datum holds info of url being fetched
On the input to the fetcher you get a URL and a CrawlDatum (originally
coming from the crawldb). Check for example how the segment name is
passed around in metadata, you can use the same method.
Hi,
I made some draft patch. But there is still some problems I see. I know
code needs to be cleaned and test. But right now I don't know what
number set to external urls. For internal linking works great.
What is the whole idea of this changes.
Injected urls always get hop 0. While fetching/updating/generating hop
value is incremented by 1. (still no idea what to do with external
link). Then I can add config value max_hop etc. to limit fetcher and
generator to create more urls.
This way it's possible to limit crawling vertically
Comments are welcome.