Hi Yves,

Yes, what you got is a "normal" result. This issue is discussed every few 
months on this list. To my mind, the segment merger is too general. It assumes 
that the segments are at arbitrary stages of completion and works on this 
assumption. But, this is not a common case at all. Mostly, people just want to 
merge finished segments. The algorithm could be much cheaper in this case.

Regards,

Arkadi

-----Original Message-----
From: Yves Petinot [mailto:y...@snooth.com] 
Sent: Friday, 26 March 2010 6:01 AM
To: nutch-user@lucene.apache.org
Subject: Running out of disk space during segment merger

Hi,

I was wondering if some people on the list have been facing issues with 
the segment merge phase, with all the nodes on their hadoop cluster 
eventually running out of disk space ? The type errors i'm getting looks 
like this:

java.io.IOException: Task: attempt_201003241258_0005_r_000001_0 - The reduce 
copier failed
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
        at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
find any valid local directory for 
file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache/job_201003241258_0005/attempt_201003241258_0005_r_000001_0/output/map_115.out
        at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2384)

FSError: java.io.IOException: No space left on device

To give a little background, I'm currrently running this on a 3 node 
cluster, each node having 500GB drives, which are mostly empty at the 
beginning of the process (~400 GB available on each node). The 
replication factor is set to 2 and i did also enable Hadoop block 
compression. Now, the nutch crawl takes up around 20 GB of disk (with 7 
segments to merge, one of them being 9 GB, the others ranging from 1 to 
3 GB in size), so intuitively there should be plenty of space available 
for the merge operation, but we still end up running out of space during 
the reduce phase (7 reduce tasks). I'm currently trying to increase the 
number of reduce tasks to limit the resource/disk consumption of any 
given task, but i'm wondering if someone has experienced this type of 
issue before and whether there is a better way of approaching it ? For 
instance would using the multiple output segments option useful in 
decreasing the amount of temp disk space needed at any given time ?

many thanks in advance,

-yp

Reply via email to