Hi, The reducing step of the updatedb requires quite a lot of memory indeed. See https://issues.apache.org/jira/browse/NUTCH-702 for a discussion on this subject. BTW you'll have to specify the parameter mapred.child.java.opts in your conf/hadoop-site.xml so that the value is sent to the hadoop slaves. Another way to do that is to specify it on the command line with : -D mapred.child.java.opts=-Xmx2000m
Julien -- DigitalPebble Ltd http://www.digitalpebble.com 2009/8/16 MoD <w...@ant.com> > Hi, > > During CrawlDb Map reduce job, > The reduce worker fail 1 by 1 with : > > java.lang.OutOfMemoryError: GC overhead limit exceeded > at > java.util.concurrent.ConcurrentHashMap$HashEntry.newArray(ConcurrentHashMap.java:205) > at > java.util.concurrent.ConcurrentHashMap$Segment.(ConcurrentHashMap.java:291) > at > java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:613) > at > java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:652) > at > org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:49) > at org.apache.hadoop.io.MapWritable.(MapWritable.java:42) > at org.apache.hadoop.io.MapWritable.(MapWritable.java:52) > at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96) > at > org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > > I have default 1Gb per JVM. > > /opt/java/jre/bin/java -Xmx1000m > > > Being out of memory for a Java process is somewhat surprising, > Does this job something that needs over 1Gb ram per node ? > > Oh by the way I don't have swap files, system have 8Gb and don't seems > to be missing any ram. > > My command line : > > nu...@titaniumpelican search $ ./bin/nutch updatedb > hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb -dir > hdfs://titaniumpelican:9000/user/nutch/crawl/segments > CrawlDb update: starting > CrawlDb update: db: hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb > CrawlDb update: segments: > [hdfs://titaniumpelican:9000/user/nutch/crawl/segments/20090814122219] > CrawlDb update: additions allowed: false > CrawlDb update: URL normalizing: false > CrawlDb update: URL filtering: false > CrawlDb update: Merging segment data into db. > java.lang.OutOfMemoryError: Java heap space > > > Question : Why this job cut work into 140 map tasks ? > > Regards, > Louis > -- DigitalPebble Ltd http://www.digitalpebble.com