There are two solutions: 1. Write a light weight version of segment version, which should not be hard if you are familiar with Hadoop. 2. Don't merge segments. If you have a reasonable number of segments, even in 100s, Nutch still can handle this.
Regards, Arkadi > -----Original Message----- > From: Yves Petinot [mailto:y...@snooth.com] > Sent: Saturday, 27 March 2010 2:21 AM > To: nutch-user@lucene.apache.org > Subject: Re: Running out of disk space during segment merger > > Thanks a lot for your reply, Arkadi. It's good to know that this is > indeed a known problem. With the current version, is there anything > more > one can do other than throwing a huge amount of temp disk space at each > node ? Based on my experience and given a replication factor of N, > would > a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ? > > -y > > arkadi.kosmy...@csiro.au wrote: > > Hi Yves, > > > > Yes, what you got is a "normal" result. This issue is discussed every > few months on this list. To my mind, the segment merger is too general. > It assumes that the segments are at arbitrary stages of completion and > works on this assumption. But, this is not a common case at all. > Mostly, people just want to merge finished segments. The algorithm > could be much cheaper in this case. > > > > Regards, > > > > Arkadi > > > > -----Original Message----- > > From: Yves Petinot [mailto:y...@snooth.com] > > Sent: Friday, 26 March 2010 6:01 AM > > To: nutch-user@lucene.apache.org > > Subject: Running out of disk space during segment merger > > > > Hi, > > > > I was wondering if some people on the list have been facing issues > with > > the segment merge phase, with all the nodes on their hadoop cluster > > eventually running out of disk space ? The type errors i'm getting > looks > > like this: > > > > java.io.IOException: Task: attempt_201003241258_0005_r_000001_0 - The > reduce copier failed > > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375) > > at org.apache.hadoop.mapred.Child.main(Child.java:158) > > Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: > Could not find any valid local directory for > file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache > /job_201003241258_0005/attempt_201003241258_0005_r_000001_0/output/map_ > 115.out > > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath > ForWrite(LocalDirAllocator.java:335) > > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll > ocator.java:124) > > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu > ceTask.java:2384) > > > > FSError: java.io.IOException: No space left on device > > > > To give a little background, I'm currrently running this on a 3 node > > cluster, each node having 500GB drives, which are mostly empty at the > > beginning of the process (~400 GB available on each node). The > > replication factor is set to 2 and i did also enable Hadoop block > > compression. Now, the nutch crawl takes up around 20 GB of disk (with > 7 > > segments to merge, one of them being 9 GB, the others ranging from 1 > to > > 3 GB in size), so intuitively there should be plenty of space > available > > for the merge operation, but we still end up running out of space > during > > the reduce phase (7 reduce tasks). I'm currently trying to increase > the > > number of reduce tasks to limit the resource/disk consumption of any > > given task, but i'm wondering if someone has experienced this type of > > issue before and whether there is a better way of approaching it ? > For > > instance would using the multiple output segments option useful in > > decreasing the amount of temp disk space needed at any given time ? > > > > many thanks in advance, > > > > -yp > > > >