eric park wrote:
hello. I tried to crawl a certain site using both nutch 0.6 and nutch 0.7,
just to compare how they are different.
However I get less urls crawled using nutch0-7 than nutch0-6. I'll paste 2
different log files below.
As you can see below, both 0.6 and 0.7 fetch same number of urls in first
depth, but in second depth, nutch0.7 fetches only 15 urls while
nutch0.7fetches 34 urls. Of course, the configuration and settings
are same.
IIRC (it was long ago...) the version 0.6 had a bug where unwanted URLs
would slip through the URLFilters. This was tightened in 0.7. Please
check that the URLs that are rejected in 0.7 are really valid URLs, i.e.
that they should be accepted.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general