Tomislav Poljak wrote:
Hi,
so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when
fetching (default settings). When using 10 threads I can fetch 25000
urls, but when using 20 threads fetcher fails with:
java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url
fetchlist. Is 20 threads to much for -Xmx1000m or is something else
wrong? What would be recommended settings (number of threads, how much
RAM is needed) for fetching a list of 100k urls (with best performance)?

I routinely run crawls with 100 threads or more. If you're using the fetcher in parsing mode (i.e. it not only fetches but also parses the content) then your problem is likely related to the memory consumption of a parsing plugin (such as PDF or MS Office parsers).

I suggest to run the fetcher in non-parsing mode (-noParsing cmd-line option), and then parsing the segment in a separate step (bin/nutch parse).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to