Hi,
I was wondering if some people on the list have been facing issues with
the segment merge phase, with all the nodes on their hadoop cluster
eventually running out of disk space ? The type errors i'm getting looks
like this:
java.io.IOException: Task: attempt_201003241258_0005_r_000001_0 - The reduce
copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:375)
at org.apache.hadoop.mapred.Child.main(Child.java:158)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not
find any valid local directory for
file:/home/snoothbot/nutch/hadoop_tmp/mapred/local/taskTracker/jobcache/job_201003241258_0005/attempt_201003241258_0005_r_000001_0/output/map_115.out
at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:335)
at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$LocalFSMerger.run(ReduceTask.java:2384)
FSError: java.io.IOException: No space left on device
To give a little background, I'm currrently running this on a 3 node
cluster, each node having 500GB drives, which are mostly empty at the
beginning of the process (~400 GB available on each node). The
replication factor is set to 2 and i did also enable Hadoop block
compression. Now, the nutch crawl takes up around 20 GB of disk (with 7
segments to merge, one of them being 9 GB, the others ranging from 1 to
3 GB in size), so intuitively there should be plenty of space available
for the merge operation, but we still end up running out of space during
the reduce phase (7 reduce tasks). I'm currently trying to increase the
number of reduce tasks to limit the resource/disk consumption of any
given task, but i'm wondering if someone has experienced this type of
issue before and whether there is a better way of approaching it ? For
instance would using the multiple output segments option useful in
decreasing the amount of temp disk space needed at any given time ?
many thanks in advance,
-yp