Uroš Gruber wrote:
> I made some draft patch. But there is still some problems I see. I 
> know code needs to be cleaned and test. But right now I don't know 
> what number set to external urls. For internal linking works great.

(the patch changes CrawlDatum itself, I think it would be better to put 
the hop counter in CrawlDatum.metaData.)

>
> What is the whole idea of this changes.
>
> Injected urls always get hop 0. While fetching/updating/generating hop 
> value is incremented by 1. (still no idea what to do with external 
> link). Then I can add config value max_hop etc. to limit fetcher and 
> generator to create more urls.
>
> This way it's possible to limit crawling vertically
>
> Comments are welcome.

Well, it really depends on what you want to do when you encounter an 
external link. Do you want to restart the counter, i.e. crawl the new 
site at full depth up to max_hop? Then set hop=0. Do you want to 
terminate the crawl at that link? then set hop=max_hop.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to