Hi Andrtzej, I am running fetcher in non-parsing mode, I have this in nutch-site.xml:
<property> <name>fetcher.parse</name> <value>false</value> <description>If true, fetcher will parse content.</description> </property> Maybe I didn't post a question correctly. I get a couple of fetcher threads failing with java.lang.OutOfMemoryError like this (from hadoop.log): 2007-09-09 01:07:24,150 INFO fetcher.Fetcher - fetching http://scholar.google.com/intl/en/scholar/libraries.html 2007-09-09 01:07:27,084 INFO fetcher.Fetcher - fetch of http://logging.apache.org/log4j/1.2/faq.html failed with: java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:27,085 INFO fetcher.Fetcher - fetching http://popular.ebay.com/ns/Tickets/Alabama+Tickets.html 2007-09-09 01:07:32,151 INFO fetcher.Fetcher - fetch of http://hockey.fantasysports.yahoo.com/hockey/register/createjoin failed with: java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:32,817 INFO fetcher.Fetcher - fetch of http://scholar.google.com/intl/en/scholar/libraries.html failed with: java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:32,817 FATAL fetcher.Fetcher - java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - fetcher caught:java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - fetcher caught:java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:37,865 INFO fetcher.Fetcher - fetch of http://cn.yahoo.com/allservice/index.html failed with: java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:38,019 FATAL fetcher.Fetcher - java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:38,020 FATAL fetcher.Fetcher - fetcher caught:java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:42,887 INFO fetcher.Fetcher - fetch of http://popular.ebay.com/ns/Tickets/Alabama+Tickets.html failed with: java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:43,045 FATAL fetcher.Fetcher - java.lang.OutOfMemoryError: Java heap space 2007-09-09 01:07:43,045 FATAL fetcher.Fetcher - fetcher caught:java.lang.OutOfMemoryError: Java heap space Any ideas why? Thanks, Tomislav On Mon, 2007-09-10 at 21:30 +0200, Andrzej Bialecki wrote: > Tomislav Poljak wrote: > > Hi, > > so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when > > fetching (default settings). When using 10 threads I can fetch 25000 > > urls, but when using 20 threads fetcher fails with: > > java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url > > fetchlist. Is 20 threads to much for -Xmx1000m or is something else > > wrong? What would be recommended settings (number of threads, how much > > RAM is needed) for fetchi > ng a list of 100k urls (with best performance)? > > I routinely run crawls with 100 threads or more. If you're using the > fetcher in parsing mode (i.e. it not only fetches but also parses the > content) then your problem is likely related to the memory consumption > of a parsing plugin (such as PDF or MS Office parsers). > > I suggest to run the fetcher in non-parsing mode (-noParsing cmd-line > option), and then parsing the segment in a separate step (bin/nutch parse). > >
