Hi Yves, I am glad it helped. Wish you success.
Regards, Arkadi > -----Original Message----- > From: Yves Petinot [mailto:y...@snooth.com] > Sent: Saturday, 10 April 2010 12:56 AM > To: nutch-user@lucene.apache.org > Subject: Re: Running out of disk space during segment merger > > Arkadi, > > thanks a lot for these suggestions. Indeed skipping the merge step > seems > to be perfectly acceptable as i have a fairly reasonable number of > clusters. As an experiment I also attempted to perform my crawl with a > single segment (which does not even raise the issue of merging) but my > cluster doesn't seem to be able to handle the amount of disk space > required during the fetch's reduce tasks. Somewhat frustrating as I > only > have about 1.2M URLs in this crawl. > > at any rate thanks again for your comments ! > > -yp > > arkadi.kosmy...@csiro.au wrote: > > There are two solutions: > > > > 1. Write a light weight version of segment version, which should not > be hard if you are familiar with Hadoop. > > 2. Don't merge segments. If you have a reasonable number of segments, > even in 100s, Nutch still can handle this. > > > > Regards, > > > > Arkadi > > > > > >> -----Original Message----- > >> From: Yves Petinot [mailto:y...@snooth.com] > >> Sent: Saturday, 27 March 2010 2:21 AM > >> To: nutch-user@lucene.apache.org > >> Subject: Re: Running out of disk space during segment merger > >> > >> Thanks a lot for your reply, Arkadi. It's good to know that this is > >> indeed a known problem. With the current version, is there anything > >> more > >> one can do other than throwing a huge amount of temp disk space at > each > >> node ? Based on my experience and given a replication factor of N, > >> would > >> a rule of thumb to reserve roughly N TB per set of 1M URLs make > sense ? > >> > >> -y > >> > >> arkadi.kosmy...@csiro.au wrote: > >> > >>> Hi Yves, > >>> > >>> Yes, what you got is a "normal" result. This issue is discussed > every > >>> > >> few months on this list. To my mind, the segment merger is too > general. > >> It assumes that the segments are at arbitrary stages of completion > and > >> works on this assumption. But, this is not a common case at all. > >> Mostly, people just want to merge finished segments. The algorithm > >> could be much cheaper in this case. > >> > >>> Regards, > >>> > >>> Arkadi > >>> > >>> -----Original Message----- > >>> From: Yves Petinot [mailto:y...@snooth.com] > >>> Sent: Friday, 26 March 2010 6:01 AM > >>> To: nutch-user@lucene.apache.org > >>> Subject: Running out of disk space during segment merger > >>> > >>> Hi, > >>> > >>> I was wondering if some people on the list have been facing issues > >>> > >> with > >> > >>> the segment merge phase, with all the nodes on their hadoop cluster > >>> eventually running out of disk space ? The type errors i'm getting > >>> > >> looks > >> > >>> like this: > >>> > >>> java.io.IOException: Task: attempt_201003241258_0005_r_000001_0 - > The > >>> > >> reduce copier failed > >> > >>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375) > >>> at org.apache.hadoop.mapred.Child.main(Child.java:158) > >>> Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: > >>> > >> Could not find any valid local directory for > >> > file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache > >> > /job_201003241258_0005/attempt_201003241258_0005_r_000001_0/output/map_ > >> 115.out > >> > >>> at > >>> > >> > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPath > >> ForWrite(LocalDirAllocator.java:335) > >> > >>> at > >>> > >> > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAll > >> ocator.java:124) > >> > >>> at > >>> > >> > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(Redu > >> ceTask.java:2384) > >> > >>> FSError: java.io.IOException: No space left on device > >>> > >>> To give a little background, I'm currrently running this on a 3 > node > >>> cluster, each node having 500GB drives, which are mostly empty at > the > >>> beginning of the process (~400 GB available on each node). The > >>> replication factor is set to 2 and i did also enable Hadoop block > >>> compression. Now, the nutch crawl takes up around 20 GB of disk > (with > >>> > >> 7 > >> > >>> segments to merge, one of them being 9 GB, the others ranging from > 1 > >>> > >> to > >> > >>> 3 GB in size), so intuitively there should be plenty of space > >>> > >> available > >> > >>> for the merge operation, but we still end up running out of space > >>> > >> during > >> > >>> the reduce phase (7 reduce tasks). I'm currently trying to increase > >>> > >> the > >> > >>> number of reduce tasks to limit the resource/disk consumption of > any > >>> given task, but i'm wondering if someone has experienced this type > of > >>> issue before and whether there is a better way of approaching it ? > >>> > >> For > >> > >>> instance would using the multiple output segments option useful in > >>> decreasing the amount of temp disk space needed at any given time ? > >>> > >>> many thanks in advance, > >>> > >>> -yp > >>> > >>> > >>> > > > > > > > -- > Yves Petinot > Senior Software Engineer > www.snooth.com > y...@snooth.com >