Uroš Gruber wrote:
I made some draft patch. But there is still some problems I see. I
know code needs to be cleaned and test. But right now I don't know
what number set to external urls. For internal linking works great.
(the patch changes CrawlDatum itself, I think it would be better to put
the hop counter in CrawlDatum.metaData.)
What is the whole idea of this changes.
Injected urls always get hop 0. While fetching/updating/generating hop
value is incremented by 1. (still no idea what to do with external
link). Then I can add config value max_hop etc. to limit fetcher and
generator to create more urls.
This way it's possible to limit crawling vertically
Comments are welcome.
Well, it really depends on what you want to do when you encounter an
external link. Do you want to restart the counter, i.e. crawl the new
site at full depth up to max_hop? Then set hop=0. Do you want to
terminate the crawl at that link? then set hop=max_hop.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com