Uroš Gruber wrote: > I made some draft patch. But there is still some problems I see. I > know code needs to be cleaned and test. But right now I don't know > what number set to external urls. For internal linking works great.
(the patch changes CrawlDatum itself, I think it would be better to put the hop counter in CrawlDatum.metaData.) > > What is the whole idea of this changes. > > Injected urls always get hop 0. While fetching/updating/generating hop > value is incremented by 1. (still no idea what to do with external > link). Then I can add config value max_hop etc. to limit fetcher and > generator to create more urls. > > This way it's possible to limit crawling vertically > > Comments are welcome. Well, it really depends on what you want to do when you encounter an external link. Do you want to restart the counter, i.e. crawl the new site at full depth up to max_hop? Then set hop=0. Do you want to terminate the crawl at that link? then set hop=max_hop. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
