Hi,

We are frequently observing the exception
java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could not 
complete file 
/output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002.
  Giving up.
on our cluster.  The exception occurs during writing a file.  We are using 
Hadoop 0.20.2. It's ~250 nodes cluster and on average 1 box goes down every 3 
days.

Detailed stack trace :
12/05/27 23:26:54 INFO mapred.JobClient: Task Id : 
attempt_201205232329_28133_r_000002_0, Status : FAILED
java.io.IOException: DFSClient_attempt_201205232329_28133_r_000002_0 could not 
complete file 
/output/tmp/test/_temporary/_attempt_201205232329_28133_r_000002_0/part-r-00002.
  Giving up.
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3331)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3240)
        at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:61)
        at 
org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:86)
        at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:106)
        at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:567)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)


Our investigation:
We have min replication factor set to 2.  As mentioned here 
(http://kazman.shidler.hawaii.edu/ArchDocDecomposition.html) , "A call to 
complete() will not return true until all the file's blocks have been 
replicated the minimum number of times.  Thus, DataNode failures may cause a 
client to call complete() several times before succeeding", we should retry 
complete() several times.
The org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls 
complete() function and retries it for 20 times.  But in spite of that file 
blocks are not replicated minimum number of times. The retry count is not 
configurable.  Changing min replication factor to 1 is also not a good idea 
since there are continuously jobs running on our cluster.

Do we have any solution / workaround for this problem?
What is min replication factor in general used in industry?

Let me know if any further inputs required.

Thanks,
-Akshay



Reply via email to