Hi,

The reducing step of the updatedb requires quite a lot of memory indeed. See
https://issues.apache.org/jira/browse/NUTCH-702 for a discussion on this
subject.
BTW you'll have to specify the parameter mapred.child.java.opts in your
conf/hadoop-site.xml so that the value is sent to the hadoop slaves. Another
way to do that is to specify it on the command line with : -D
mapred.child.java.opts=-Xmx2000m

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/8/16 MoD <w...@ant.com>

> Hi,
>
> During CrawlDb Map reduce job,
> The reduce worker fail 1 by 1 with :
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>        at
> java.util.concurrent.ConcurrentHashMap$HashEntry.newArray(ConcurrentHashMap.java:205)
>        at
> java.util.concurrent.ConcurrentHashMap$Segment.(ConcurrentHashMap.java:291)
>        at
> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:613)
>        at
> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:652)
>        at
> org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:49)
>        at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)
>        at org.apache.hadoop.io.MapWritable.(MapWritable.java:52)
>        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
>
> I have default 1Gb per JVM.
>
> /opt/java/jre/bin/java -Xmx1000m
>
>
> Being out of memory for a Java process is somewhat surprising,
> Does this job something that needs over 1Gb ram per node ?
>
> Oh by the way I don't have swap files, system have 8Gb and don't seems
> to be missing any ram.
>
> My command line :
>
> nu...@titaniumpelican search $ ./bin/nutch  updatedb
> hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb -dir
> hdfs://titaniumpelican:9000/user/nutch/crawl/segments
> CrawlDb update: starting
> CrawlDb update: db: hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb
> CrawlDb update: segments:
> [hdfs://titaniumpelican:9000/user/nutch/crawl/segments/20090814122219]
> CrawlDb update: additions allowed: false
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: Merging segment data into db.
> java.lang.OutOfMemoryError: Java heap space
>
>
> Question : Why this job cut work into 140 map tasks ?
>
> Regards,
> Louis
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to