Hi, On 9/11/07, Tomislav Poljak <[EMAIL PROTECTED]> wrote: > Hi Andrtzej, > I am running fetcher in non-parsing mode, I have this in nutch-site.xml: > > <property> > <name>fetcher.parse</name> > <value>false</value> > <description>If true, fetcher will parse content.</description> > </property> > > Maybe I didn't post a question correctly. I get a couple of fetcher > threads failing with java.lang.OutOfMemoryError like this (from > hadoop.log): > > 2007-09-09 01:07:24,150 INFO fetcher.Fetcher - fetching > http://scholar.google.com/intl/en/scholar/libraries.html > 2007-09-09 01:07:27,084 INFO fetcher.Fetcher - fetch of > http://logging.apache.org/log4j/1.2/faq.html failed with: > java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:27,085 INFO fetcher.Fetcher - fetching > http://popular.ebay.com/ns/Tickets/Alabama+Tickets.html > 2007-09-09 01:07:32,151 INFO fetcher.Fetcher - fetch of > http://hockey.fantasysports.yahoo.com/hockey/register/createjoin failed > with: java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:32,817 INFO fetcher.Fetcher - fetch of > http://scholar.google.com/intl/en/scholar/libraries.html failed with: > java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:32,817 FATAL fetcher.Fetcher - > java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - fetcher > caught:java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - > java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:33,380 FATAL fetcher.Fetcher - fetcher > caught:java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:37,865 INFO fetcher.Fetcher - fetch of > http://cn.yahoo.com/allservice/index.html failed with: > java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:38,019 FATAL fetcher.Fetcher - > java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:38,020 FATAL fetcher.Fetcher - fetcher > caught:java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:42,887 INFO fetcher.Fetcher - fetch of > http://popular.ebay.com/ns/Tickets/Alabama+Tickets.html failed with: > java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:43,045 FATAL fetcher.Fetcher - > java.lang.OutOfMemoryError: Java heap space > 2007-09-09 01:07:43,045 FATAL fetcher.Fetcher - fetcher > caught:java.lang.OutOfMemoryError: Java heap space > > Any ideas why?
Has your fetch been going on for a long time? Nutch can leak some plugins and classes in local mode. But it only becomes a problem if you have too many maps (because each new map task loads new classes without, it seems, unloading older ones.) Related issue: NUTCH-356 > > Thanks, > Tomislav > > > On Mon, 2007-09-10 at 21:30 +0200, Andrzej Bialecki wrote: > > Tomislav Poljak wrote: > > > Hi, > > > so I have dedicated 1000 Mb (-Xmx1000m) to Nutch java process when > > > fetching (default settings). When using 10 threads I can fetch 25000 > > > urls, but when using 20 threads fetcher fails with: > > > java.lang.OutOfMemoryError: Java heap space even when fetching 15000 url > > > fetchlist. Is 20 threads to much for -Xmx1000m or is something else > > > wrong? What would be recommended settings (number of threads, how much > > > RAM is needed) for fetchi > > > ng a list of 100k urls (with best performance)? > > > > I routinely run crawls with 100 threads or more. If you're using the > > fetcher in parsing mode (i.e. it not only fetches but also parses the > > content) then your problem is likely related to the memory consumption > > of a parsing plugin (such as PDF or MS Office parsers). > > > > I suggest to run the fetcher in non-parsing mode (-noParsing cmd-line > > option), and then parsing the segment in a separate step (bin/nutch parse). > > > > > > -- Doğacan Güney
