Hi Yves,

I am glad it helped. Wish you success.

Regards,

Arkadi

> -----Original Message-----
> From: Yves Petinot [mailto:y...@snooth.com]
> Sent: Saturday, 10 April 2010 12:56 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Running out of disk space during segment merger
> 
> Arkadi,
> 
> thanks a lot for these suggestions. Indeed skipping the merge step
> seems
> to be perfectly acceptable as i have a fairly reasonable number of
> clusters. As an experiment I also attempted to perform my crawl with a
> single segment (which does not even raise the issue of merging) but my
> cluster doesn't seem to be able to handle the amount of disk space
> required during the fetch's reduce tasks. Somewhat frustrating as I
> only
> have about 1.2M URLs in this crawl.
> 
> at any rate thanks again for your comments !
> 
> -yp
> 
> arkadi.kosmy...@csiro.au wrote:
> > There are two solutions:
> >
> > 1. Write a light weight version of segment version, which should not
> be hard if you are familiar with Hadoop.
> > 2. Don't merge segments. If you have a reasonable number of segments,
> even in 100s, Nutch still can handle this.
> >
> > Regards,
> >
> > Arkadi
> >
> >
> >> -----Original Message-----
> >> From: Yves Petinot [mailto:y...@snooth.com]
> >> Sent: Saturday, 27 March 2010 2:21 AM
> >> To: nutch-user@lucene.apache.org
> >> Subject: Re: Running out of disk space during segment merger
> >>
> >> Thanks a lot for your reply, Arkadi. It's good to know that this is
> >> indeed a known problem. With the current version, is there anything
> >> more
> >> one can do other than throwing a huge amount of temp disk space at
> each
> >> node ? Based on my experience and given a replication factor of N,
> >> would
> >> a rule of thumb to reserve roughly N TB per set of 1M URLs make
> sense ?
> >>
> >> -y
> >>
> >> arkadi.kosmy...@csiro.au wrote:
> >>
> >>> Hi Yves,
> >>>
> >>> Yes, what you got is a "normal" result. This issue is discussed
> every
> >>>
> >> few months on this list. To my mind, the segment merger is too
> general.
> >> It assumes that the segments are at arbitrary stages of completion
> and
> >> works on this assumption. But, this is not a common case at all.
> >> Mostly, people just want to merge finished segments. The algorithm
> >> could be much cheaper in this case.
> >>
> >>> Regards,
> >>>
> >>> Arkadi
> >>>
> >>> -----Original Message-----
> >>> From: Yves Petinot [mailto:y...@snooth.com]
> >>> Sent: Friday, 26 March 2010 6:01 AM
> >>> To: nutch-user@lucene.apache.org
> >>> Subject: Running out of disk space during segment merger
> >>>
> >>> Hi,
> >>>
> >>> I was wondering if some people on the list have been facing issues
> >>>
> >> with
> >>
> >>> the segment merge phase, with all the nodes on their hadoop cluster
> >>> eventually running out of disk space ? The type errors i'm getting
> >>>
> >> looks
> >>
> >>> like this:
> >>>
> >>> java.io.IOException: Task: attempt_201003241258_0005_r_000001_0 -
> The
> >>>
> >> reduce copier failed
> >>
> >>>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
> >>>   at org.apache.hadoop.mapred.Child.main(Child.java:158)
> >>> Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException:
> >>>
> >> Could not find any valid local directory for
> >>
> file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache
> >>
> /job_201003241258_0005/attempt_201003241258_0005_r_000001_0/output/map_
> >> 115.out
> >>
> >>>   at
> >>>
> >>
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath
> >> ForWrite(LocalDirAllocator.java:335)
> >>
> >>>   at
> >>>
> >>
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll
> >> ocator.java:124)
> >>
> >>>   at
> >>>
> >>
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu
> >> ceTask.java:2384)
> >>
> >>> FSError: java.io.IOException: No space left on device
> >>>
> >>> To give a little background, I'm currrently running this on a 3
> node
> >>> cluster, each node having 500GB drives, which are mostly empty at
> the
> >>> beginning of the process (~400 GB available on each node). The
> >>> replication factor is set to 2 and i did also enable Hadoop block
> >>> compression. Now, the nutch crawl takes up around 20 GB of disk
> (with
> >>>
> >> 7
> >>
> >>> segments to merge, one of them being 9 GB, the others ranging from
> 1
> >>>
> >> to
> >>
> >>> 3 GB in size), so intuitively there should be plenty of space
> >>>
> >> available
> >>
> >>> for the merge operation, but we still end up running out of space
> >>>
> >> during
> >>
> >>> the reduce phase (7 reduce tasks). I'm currently trying to increase
> >>>
> >> the
> >>
> >>> number of reduce tasks to limit the resource/disk consumption of
> any
> >>> given task, but i'm wondering if someone has experienced this type
> of
> >>> issue before and whether there is a better way of approaching it ?
> >>>
> >> For
> >>
> >>> instance would using the multiple output segments option useful in
> >>> decreasing the amount of temp disk space needed at any given time ?
> >>>
> >>> many thanks in advance,
> >>>
> >>> -yp
> >>>
> >>>
> >>>
> >
> >
> 
> 
> --
> Yves Petinot
> Senior Software Engineer
> www.snooth.com
> y...@snooth.com
> 

Reply via email to