eric park wrote:
hello. I tried to crawl a certain site using both nutch 0.6 and nutch 0.7,
just to compare how they are different.

However I get less urls crawled using nutch0-7 than nutch0-6.   I'll paste 2
different log files below.



As you can see below, both 0.6 and 0.7 fetch same number of urls in first
depth, but in second depth, nutch0.7 fetches only 15 urls while
nutch0.7fetches 34 urls.  Of course, the configuration and settings
are same.

IIRC (it was long ago...) the version 0.6 had a bug where unwanted URLs would slip through the URLFilters. This was tightened in 0.7. Please check that the URLs that are rejected in 0.7 are really valid URLs, i.e. that they should be accepted.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to