Hi,
I am using the trunk version of Nutch in a cluster of 5 EC2 nodes to crawl
the Internet. Each nodes has 7GB of memory and I have
configured mapred.child.java.opts to -Xmx3000m in hadoop-site.xml. When I
tried to update the crawldb of about 20M of urls with a crawl segment with
5M of fetched content, I got the following error:

java.lang.OutOfMemoryError: Java heap space

at java.util.concurrent.locks.ReentrantLock.(Unknown Source)

at java.util.concurrent.ConcurrentHashMap$Segment.(Unknown Source)

at java.util.concurrent.ConcurrentHashMap.(Unknown Source)

at java.util.concurrent.ConcurrentHashMap.(Unknown Source)

at org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:46)

at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)

at org.apache.hadoop.io.MapWritable.(MapWritable.java:52)

at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:311)

at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)

at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:1)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)

at org.apache.hadoop.mapred.Child.main(Child.java:155)


java.lang.OutOfMemoryError: GC overhead limit exceeded

at java.util.concurrent.locks.ReentrantLock.(Unknown Source)

at java.util.concurrent.ConcurrentHashMap$Segment.(Unknown Source)

at java.util.concurrent.ConcurrentHashMap.(Unknown Source)

at java.util.concurrent.ConcurrentHashMap.(Unknown Source)

at org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:46)

at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)

at org.apache.nutch.crawl.CrawlDatum.(CrawlDatum.java:135)

at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:95)

at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:1)

at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:430)

at org.apache.hadoop.mapred.Child.main(Child.java:155)


Anyone has an idea on this problem? I supposed that output of reduce
function is written to filesystem immediately instead of being hold in
memory longer than necessary, otherwise the system would not be able to
scale. I think 3GB limit is maximum because there is no swap space in EC2
and each node can run maximum of 2 map/reduce tasks.

Thank you very much.

Regards

Edwin

Reply via email to