Lots of fetch failures
----------------------

                 Key: MAPREDUCE-1171
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1171
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: task
    Affects Versions: 0.20.1
            Reporter: Christian Kunz


Since we upgraded to hadoop-0.20.1  from hadoop0.18.3, we see lot of more map 
task failures because of 'Too many fetch-failures'.

One of our jobs makes hardly any progress, because of 3000 reduces not able to 
get map output of 2 trailing maps (with about 80GB output each), which 
repeatedly are marked as failures because of reduces not being able to get 
their map output.
One difference to hadoop-0.18.3 seems to be that reduce tasks report a failed 
mapoutput fetch even after a single try when it was a read error 
(cr.getError().equals(CopyOutputErrorType.READ_ERROR). I do not think this is a 
good idea, as trailing map tasks will be attacked by all reduces simultaneously.

Here is a log output of a reduce task:
{noformat}
2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
attempt_200910281903_0028_r_000000_0 copy failed: 
attempt_200910281903_0028_m_002781_1 from some host
2009-10-29 21:38:36,148 WARN org.apache.hadoop.mapred.ReduceTask: 
java.net.SocketTimeoutException: Read timed out        at 
java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:687)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:632)
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1064)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1496)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1377)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1289)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1220)

2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Task 
attempt_200910281903_0028_r_000000_0: Failed fetch #1 from 
attempt_200910281903_0028_m_002781_1
2009-10-29 21:38:36,149 INFO org.apache.hadoop.mapred.ReduceTask: Failed to 
fetch map-output from attempt_200910281903_0028_m_002781_1 even after 
MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting to the 
JobTracker.
{noformat}

Also I saw a few log messages which look suspicious as if successfully fetched 
map output is discarded because of the map being marked as failed (because of 
too many fetch failures). This would make the situation even worse.

{noformat}
2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: header: 
attempt_200910281903_0028_m_001076_0, compressed len: 21882555, decompressed 
len: 23967845
2009-10-29 22:07:28,729 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 
23967845 bytes (21882555 raw bytes) into RAM from 
attempt_200910281903_0028_m_001076_0
2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Read 23967845 
bytes from map-output for attempt_200910281903_0028_m_001076_0
2009-10-29 22:07:43,602 INFO org.apache.hadoop.mapred.ReduceTask: Rec #1 from 
attempt_200910281903_0028_m_001076_0 -> (20, 39772) from some host
...
2009-10-29 22:10:07,220 INFO org.apache.hadoop.mapred.ReduceTask: Ignoring 
obsolete output of FAILED map-task: 'attempt_200910281903_0028_m_001076_0'
{noformat}





-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to