AJ wrote:
I tried to run 10 cycles of fetch/updatabs. In the 3rd cycle, the fetch list had 8810 urls. Fetch ran pretty fast on my laptop before 4000 pages were fetched. After 4000 pages, it suddenly switched to very slow speed, about 30 mins for just 100 pages. My laptop also started to run at 100% CPU all the time. Is there a threshold for fetch list size, above which fetch performance will be degraded? Or it was because my laptop? I know "-topN" option can control the fetch size. But, topN=4000 seems too small because it will end up thousands of segments. Is there a good rule of thumb for topN setting ?

A related question is how big a segment should be in order to keep the number of segments small without hitting fetch performance too much. For example, to crawl 1 million pages in one run (has many fetch cycles), what will be a good limit for each fetch list?

There are no artificial limits like that - I'm routinely fetching segments of 1 mln pages. Most likely what happened to you is that:

* you are using Nutch version with PDFBox 0.7.1 or below

* you fetched a rare kind of PDF, which puts PDFBox in a tight loop

* the thread that got stuck is consuming 99% of your CPU. :-)

Solution: upgrade PDFBox to the yet unreleased 0.7.2 .


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to