that seems to work; thanks for that. hi there, i ended up using a topN during my generate phase, but i didnt want to do this - although it seems fix my problem.
what i have been observing also is that the reduce action seems to take very long on large segments. would any one shed some light on the ratio of segment size to processing time ratio on a standard machine say: 2G RAM 4-core server ? my segments in my test environment are now about 3Meg or so each, at one point i had a 600Meg segments and mergeSegments seemed to take forever and eventually stop responding.. the last i heard from the process was something like: 2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce > reduce .. .. over and over then finally silence... any insight? > > i guess your segments are too big... > try to merge just few of them in one shot.... > if you have N segments try to start merging just N-1 if you still hv the > error, try N-2...till you fill find the best number of segments you can > merge at one shot.... > > thx > > >> Subject: OutOfMemoryError: Java heap space >> From: fa...@butterflycluster.net >> To: nutch-user@lucene.apache.org >> Date: Sun, 11 Oct 2009 15:26:14 +1100 >> >> hi all, >> >> I am getting this JVM error below during a recrawl specifically during >> the execution of >> >> $NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/* >> >> i am running on a single machine: >> Linux 2.6.24-23-xen x86_64 >> 4G RAM >> java-6-sun >> nutch-1.0 >> JAVA_HEAP_MAX=-Xmx1000m >> >> Any suggestions? I am about to up my heap max to Xmx2000m >> >> i havent encountered this before running with the above specs, so i am >> not sure what could have changed? >> Any suggestions will be greatly appreciated. >> >> Thanks. >> >> >> > >> > >> > 2009-10-11 14:29:56,752 INFO [org.apache.hadoop.mapred.LocalJobRunner] >> - reduce > reduce >> > 2009-10-11 14:30:15,801 INFO [org.apache.hadoop.mapred.LocalJobRunner] >> - reduce > reduce >> > 2009-10-11 14:31:19,197 INFO [org.apache.hadoop.mapred.TaskRunner] - >> Communication exception: java.lang.OutOfMemoryError: Java heap space >> > at >> java.util.ResourceBundle$Control.getCandidateLocales(ResourceBundle.java:2220) >> > at java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1229) >> > at java.util.ResourceBundle.getBundle(ResourceBundle.java:715) >> > at >> org.apache.hadoop.mapred.Counters$Group.getResourceBundle(Counters.java:218) >> > at org.apache.hadoop.mapred.Counters$Group.<init>(Counters.java:202) >> > at org.apache.hadoop.mapred.Counters.getGroup(Counters.java:410) >> > at >> org.apache.hadoop.mapred.Counters.incrAllCounters(Counters.java:491) >> > at org.apache.hadoop.mapred.Counters.sum(Counters.java:506) >> > at >> org.apache.hadoop.mapred.LocalJobRunner$Job.statusUpdate(LocalJobRunner.java:222) >> > at org.apache.hadoop.mapred.Task$1.run(Task.java:418) >> > at java.lang.Thread.run(Thread.java:619) >> > >> > 2009-10-11 14:31:22,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] >> - reduce > reduce >> > 2009-10-11 14:31:25,197 INFO [org.apache.hadoop.mapred.LocalJobRunner] >> - reduce > reduce >> > 2009-10-11 14:31:40,002 WARN [org.apache.hadoop.mapred.LocalJobRunner] >> - job_local_0001 >> > java.lang.OutOfMemoryError: Java heap space >> > at >> java.util.concurrent.locks.ReentrantLock.<init>(ReentrantLock.java:234) >> > at >> java.util.concurrent.ConcurrentHashMap$Segment.<init>(ConcurrentHashMap.java:289) >> > at >> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:613) >> > at >> java.util.concurrent.ConcurrentHashMap.<init>(ConcurrentHashMap.java:652) >> > at >> org.apache.hadoop.io.AbstractMapWritable.<init>(AbstractMapWritable.java:49) >> > at org.apache.hadoop.io.MapWritable.<init>(MapWritable.java:42) >> > at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:260) >> > at >> org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54) >> > at >> org.apache.nutch.metadata.MetaWrapper.readFields(MetaWrapper.java:101) >> > at >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67) >> > at >> org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40) >> > at >> org.apache.hadoop.mapred.Task$ValuesIterator.readNextValue(Task.java:940) >> > at org.apache.hadoop.mapred.Task$ValuesIterator.next(Task.java:880) >> > at >> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.moveToNext(ReduceTask.java:237) >> > at >> org.apache.hadoop.mapred.ReduceTask$ReduceValuesIterator.next(ReduceTask.java:233) >> > at >> org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:377) >> > at >> org.apache.nutch.segment.SegmentMerger.reduce(SegmentMerger.java:113) >> > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436) >> > at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:170) >> > > _________________________________________________________________ > New! Faster Messenger access on the new MSN homepage > http://go.microsoft.com/?linkid=9677406