Re: OutOfMemoryError while fetching

Andrzej Bialecki Mon, 10 Sep 2007 12:30:37 -0700

Tomislav Poljak wrote:

Hi,
so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when
fetching (default settings). When using 10 threads I can fetch 25000
urls, but when using 20 threads fetcher fails with:
java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url
fetchlist. Is 20 threads to much for -Xmx1000m or is something else
wrong? What would be recommended settings (number of threads, how much
RAM is needed) for fetching a list of 100k urls (with best performance)?

I routinely run crawls with 100 threads or more. If you're using thefetcher in parsing mode (i.e. it not only fetches but also parses thecontent) then your problem is likely related to the memory consumptionof a parsing plugin (such as PDF or MS Office parsers).

I suggest to run the fetcher in non-parsing mode (-noParsing cmd-lineoption), and then parsing the segment in a separate step (bin/nutch parse).



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: OutOfMemoryError while fetching

Reply via email to