There are two solutions:

1. Write a light weight version of segment version, which should not be hard if 
you are familiar with Hadoop.
2. Don't merge segments. If you have a reasonable number of segments, even in 
100s, Nutch still can handle this.

Regards,

Arkadi

> -----Original Message-----
> From: Yves Petinot [mailto:y...@snooth.com]
> Sent: Saturday, 27 March 2010 2:21 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Running out of disk space during segment merger
> 
> Thanks a lot for your reply, Arkadi. It's good to know that this is
> indeed a known problem. With the current version, is there anything
> more
> one can do other than throwing a huge amount of temp disk space at each
> node ? Based on my experience and given a replication factor of N,
> would
> a rule of thumb to reserve roughly N TB per set of 1M URLs make sense ?
> 
> -y
> 
> arkadi.kosmy...@csiro.au wrote:
> > Hi Yves,
> >
> > Yes, what you got is a "normal" result. This issue is discussed every
> few months on this list. To my mind, the segment merger is too general.
> It assumes that the segments are at arbitrary stages of completion and
> works on this assumption. But, this is not a common case at all.
> Mostly, people just want to merge finished segments. The algorithm
> could be much cheaper in this case.
> >
> > Regards,
> >
> > Arkadi
> >
> > -----Original Message-----
> > From: Yves Petinot [mailto:y...@snooth.com]
> > Sent: Friday, 26 March 2010 6:01 AM
> > To: nutch-user@lucene.apache.org
> > Subject: Running out of disk space during segment merger
> >
> > Hi,
> >
> > I was wondering if some people on the list have been facing issues
> with
> > the segment merge phase, with all the nodes on their hadoop cluster
> > eventually running out of disk space ? The type errors i'm getting
> looks
> > like this:
> >
> > java.io.IOException: Task: attempt_201003241258_0005_r_000001_0 - The
> reduce copier failed
> >     at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
> >     at org.apache.hadoop.mapred.Child.main(Child.java:158)
> > Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:
> Could not find any valid local directory for
> file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache
> /job_201003241258_0005/attempt_201003241258_0005_r_000001_0/output/map_
> 115.out
> >     at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath
> ForWrite(LocalDirAllocator.java:335)
> >     at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll
> ocator.java:124)
> >     at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu
> ceTask.java:2384)
> >
> > FSError: java.io.IOException: No space left on device
> >
> > To give a little background, I'm currrently running this on a 3 node
> > cluster, each node having 500GB drives, which are mostly empty at the
> > beginning of the process (~400 GB available on each node). The
> > replication factor is set to 2 and i did also enable Hadoop block
> > compression. Now, the nutch crawl takes up around 20 GB of disk (with
> 7
> > segments to merge, one of them being 9 GB, the others ranging from 1
> to
> > 3 GB in size), so intuitively there should be plenty of space
> available
> > for the merge operation, but we still end up running out of space
> during
> > the reduce phase (7 reduce tasks). I'm currently trying to increase
> the
> > number of reduce tasks to limit the resource/disk consumption of any
> > given task, but i'm wondering if someone has experienced this type of
> > issue before and whether there is a better way of approaching it ?
> For
> > instance would using the multiple output segments option useful in
> > decreasing the amount of temp disk space needed at any given time ?
> >
> > many thanks in advance,
> >
> > -yp
> >
> >

Reply via email to