Uroš Gruber wrote:
I made some draft patch. But there is still some problems I see. I know code needs to be cleaned and test. But right now I don't know what number set to external urls. For internal linking works great.

(the patch changes CrawlDatum itself, I think it would be better to put the hop counter in CrawlDatum.metaData.)


What is the whole idea of this changes.

Injected urls always get hop 0. While fetching/updating/generating hop value is incremented by 1. (still no idea what to do with external link). Then I can add config value max_hop etc. to limit fetcher and generator to create more urls.

This way it's possible to limit crawling vertically

Comments are welcome.

Well, it really depends on what you want to do when you encounter an external link. Do you want to restart the counter, i.e. crawl the new site at full depth up to max_hop? Then set hop=0. Do you want to terminate the crawl at that link? then set hop=max_hop.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to