AJ wrote:
I tried to run 10 cycles of fetch/updatabs. In the 3rd cycle, the fetch
list had 8810 urls. Fetch ran pretty fast on my laptop before 4000
pages were fetched. After 4000 pages, it suddenly switched to very slow
speed, about 30 mins for just 100 pages. My laptop also started to run
at 100% CPU all the time. Is there a threshold for fetch list size,
above which fetch performance will be degraded? Or it was because my
laptop? I know "-topN" option can control the fetch size. But, topN=4000
seems too small because it will end up thousands of segments. Is there
a good rule of thumb for topN setting ?
A related question is how big a segment should be in order to keep the
number of segments small without hitting fetch performance too much. For
example, to crawl 1 million pages in one run (has many fetch cycles),
what will be a good limit for each fetch list?
There are no artificial limits like that - I'm routinely fetching
segments of 1 mln pages. Most likely what happened to you is that:
* you are using Nutch version with PDFBox 0.7.1 or below
* you fetched a rare kind of PDF, which puts PDFBox in a tight loop
* the thread that got stuck is consuming 99% of your CPU. :-)
Solution: upgrade PDFBox to the yet unreleased 0.7.2 .
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com