[ 
https://issues.apache.org/jira/browse/TEZ-3198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228415#comment-15228415
 ] 

Jason Lowe commented on TEZ-3198:
---------------------------------

bq. Would it be possible to share the task logs for the uncompleted task and 
the AM log?
It's kind of a pain to scrub the full logs for posting.  Here's the relevant 
portions of one of the task attempt logs relating to the errors it was 
receiving and the logic for deciding to fail itself:
{noformat}
2016-04-01 23:23:57,220 [INFO] [fetcher {scope_177} #36] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_016212_0_10004,attempt_1458300907858_465665_1_00_013588_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:23:57,220 [WARN] [fetcher {scope_177} #36] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:23:57,221 [WARN] [fetcher {scope_177} #36] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:23:57,221 [INFO] [fetcher {scope_177} #36] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=16212, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_016212_00 to AM.
2016-04-01 23:23:57,224 [INFO] [fetcher {scope_177} #36] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=0, hosts crossing threshold=0, reducerFetchIssues=false
[...]
2016-04-01 23:24:00,930 [INFO] [fetcher {scope_177} #38] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_013588_0_10004,attempt_1458300907858_465665_1_00_016212_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:00,930 [WARN] [fetcher {scope_177} #38] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
[...]
2016-04-01 23:24:00,930 [WARN] [fetcher {scope_177} #38] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:00,931 [INFO] [fetcher {scope_177} #38] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=13588, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_013588_00 to AM.
2016-04-01 23:24:00,932 [INFO] [fetcher {scope_177} #38] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=0, reducerFetchIssues=false
[...]
2016-04-01 23:24:15,149 [INFO] [fetcher {scope_177} #32] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_016212_0_10004,attempt_1458300907858_465665_1_00_013588_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:15,150 [WARN] [fetcher {scope_177} #32] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:15,150 [WARN] [fetcher {scope_177} #32] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:15,150 [INFO] [fetcher {scope_177} #32] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=16212, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_016212_00 to AM.
2016-04-01 23:24:15,152 [INFO] [fetcher {scope_177} #32] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=0, reducerFetchIssues=false
[...]
2016-04-01 23:24:18,539 [INFO] [fetcher {scope_177} #33] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_013588_0_10004,attempt_1458300907858_465665_1_00_016212_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:18,539 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:18,539 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:18,539 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=13588, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_013588_00 to AM.
2016-04-01 23:24:18,542 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=0, reducerFetchIssues=false
2016-04-01 23:24:21,924 [INFO] [fetcher {scope_177} #33] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_016212_0_10004,attempt_1458300907858_465665_1_00_013588_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:21,926 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:21,926 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:21,926 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=16212, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_016212_00 to AM.
2016-04-01 23:24:21,928 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=0, reducerFetchIssues=false
2016-04-01 23:24:26,325 [INFO] [fetcher {scope_177} #33] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_013588_0_10004,attempt_1458300907858_465665_1_00_016212_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:26,326 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:26,326 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:26,326 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=13588, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_013588_00 to AM.
2016-04-01 23:24:26,328 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=1, reducerFetchIssues=false
2016-04-01 23:24:30,724 [INFO] [fetcher {scope_177} #33] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_016212_0_10004,attempt_1458300907858_465665_1_00_013588_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:30,726 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:30,726 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:30,726 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=16212, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_016212_00 to AM.
2016-04-01 23:24:30,728 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=1, reducerFetchIssues=false
2016-04-01 23:24:36,442 [INFO] [fetcher {scope_177} #33] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_013588_0_10004,attempt_1458300907858_465665_1_00_016212_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:36,443 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:36,443 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:36,443 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=13588, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_013588_00 to AM.
2016-04-01 23:24:36,446 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=1, reducerFetchIssues=false
2016-04-01 23:24:42,163 [INFO] [fetcher {scope_177} #33] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_016212_0_10004,attempt_1458300907858_465665_1_00_013588_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:42,163 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:42,163 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:42,163 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=16212, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_016212_00 to AM.
2016-04-01 23:24:42,165 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=1, reducerFetchIssues=false
2016-04-01 23:24:49,593 [INFO] [fetcher {scope_177} #33] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_013588_0_10004,attempt_1458300907858_465665_1_00_016212_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:49,593 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:49,593 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:49,593 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=13588, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_013588_00 to AM.
2016-04-01 23:24:49,596 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=1, reducerFetchIssues=false
2016-04-01 23:24:57,024 [INFO] [fetcher {scope_177} #33] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_016212_0_10004,attempt_1458300907858_465665_1_00_013588_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:57,025 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:57,025 [WARN] [fetcher {scope_177} #33] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:57,026 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=16212, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_016212_00 to AM.
2016-04-01 23:24:57,029 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=1, reducerFetchIssues=false
2016-04-01 23:24:57,029 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=16212, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_016212_0_10004, spillType=0, 
spillId=-1], fetcherHealthy=false, failedShufflesSinceLastCompletion=8, 
remainingMaps=2
2016-04-01 23:24:57,029 [ERROR] [fetcher {scope_177} #33] 
|orderedgrouped.ShuffleScheduler|: scope_177: Shuffle failed with too many 
fetch failures and insufficient progress!failureCounts=2, pendingInputs=2, 
fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
2016-04-01 23:24:57,029 [INFO] [fetcher {scope_177} #33] 
|orderedgrouped.Shuffle|: scope_177: Setting throwable in reportException with 
message [scope_177: Shuffle failed with too many fetch failures and 
insufficient progress!failureCounts=2, pendingInputs=2, fetcherHealthy=false, 
reducerProgressedEnough=true, reducerStalled=true] from thread [fetcher 
{scope_177} #33
2016-04-01 23:24:57,032 [ERROR] [ShuffleAndMergeRunner {scope_177}] 
|orderedgrouped.Shuffle|: scope_177: ShuffleRunner failed with error
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #33
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:360)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:336)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: scope_177: Shuffle failed with too many fetch 
failures and insufficient progress!failureCounts=2, pendingInputs=2, 
fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:806)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:547)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:321)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
2016-04-01 23:24:57,032 [INFO] [fetcher {scope_177} #45] 
|shuffle.HttpConnection|: for 
url=http://xxx:xxx/mapOutput?job=job_1458300907858_465665&reduce=857&map=attempt_1458300907858_465665_1_00_013588_0_10004,attempt_1458300907858_465665_1_00_016212_0_10004&keepAlive=true
 sent hash and receievd reply 0 ms
2016-04-01 23:24:57,032 [ERROR] [TezChild] |runtime.PigProcessor|: Encountered 
exception while processing: 
org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #33
        at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POShuffleTezLoad.attachInputs(POShuffleTezLoad.java:121)
        at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.initializeInputs(PigProcessor.java:332)
        at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:210)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #33
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:360)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:336)
        ... 5 more
Caused by: java.io.IOException: scope_177: Shuffle failed with too many fetch 
failures and insufficient progress!failureCounts=2, pendingInputs=2, 
fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:806)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:547)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:321)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
2016-04-01 23:24:57,035 [WARN] [fetcher {scope_177} #45] 
|orderedgrouped.FetcherOrderedGrouped|: Invalid map id: TTP/1.1 500 Internal 
Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, 
partition: 13
2016-04-01 23:24:57,035 [WARN] [fetcher {scope_177} #45] 
|orderedgrouped.FetcherOrderedGrouped|: copyMapOutput failed for tasks 
[InputAttemptIdentifier [inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1]]
2016-04-01 23:24:57,035 [INFO] [TezChild] |task.TezTaskRunner|: Encounted an 
error while executing task: attempt_1458300907858_465665_1_02_000857_1
org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #33
        at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POShuffleTezLoad.attachInputs(POShuffleTezLoad.java:121)
        at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.initializeInputs(PigProcessor.java:332)
        at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:210)
        at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:347)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:179)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:171)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1679)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:171)
        at 
org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:167)
        at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #33
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:360)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:336)
        ... 5 more
Caused by: java.io.IOException: scope_177: Shuffle failed with too many fetch 
failures and insufficient progress!failureCounts=2, pendingInputs=2, 
fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:806)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:547)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:321)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:176)
        at 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.run(FetcherOrderedGrouped.java:191)
2016-04-01 23:24:57,036 [INFO] [fetcher {scope_177} #45] 
|orderedgrouped.ShuffleScheduler|: scope_177: Reporting fetch failure for 
InputIdentifier: InputAttemptIdentifier [inputIdentifier=13588, 
attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1] taskAttemptIdentifier: scope-177_013588_00 to AM.
2016-04-01 23:24:57,037 [INFO] [ShuffleAndMergeRunner {scope_177}] 
|orderedgrouped.ShuffleInputEventHandlerOrderedGrouped|: scope-177: 
numDmeEventsSeen=17756, numDmeEventsSeenWithNoData=5, 
numObsoletionEventsSeen=7, updateOnClose
2016-04-01 23:24:57,039 [INFO] [fetcher {scope_177} #45] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1], numUniqueHosts=3578, hostFailureThreshold=716, 
hostFailuresCount=1, hosts crossing threshold=1, reducerFetchIssues=false
2016-04-01 23:24:57,039 [INFO] [fetcher {scope_177} #45] 
|orderedgrouped.ShuffleScheduler|: srcAttempt=InputAttemptIdentifier 
[inputIdentifier=13588, attemptNumber=0, 
pathComponent=attempt_1458300907858_465665_1_00_013588_0_10004, spillType=0, 
spillId=-1], fetcherHealthy=false, failedShufflesSinceLastCompletion=9, 
remainingMaps=2
2016-04-01 23:24:57,039 [ERROR] [fetcher {scope_177} #45] 
|orderedgrouped.ShuffleScheduler|: scope_177: Shuffle failed with too many 
fetch failures and insufficient progress!failureCounts=2, pendingInputs=2, 
fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true
{noformat}

So we can see from the log that the task failed after only a minute of retries, 
and that's not enough to trigger the 5 minute error threshold in the AM to 
retry the upstream task.  The remaining two task attempts did the same thing.  
The first attempt ran successfully but was retroactively failed due to reported 
fetch failures from downstream, and that explains why this task was the only 
one of its vertex trying to shuffle at the time.

Here's the relevant portion of the AM log for attempts beyond attempt 0.  It's 
not very interesting since it's mostly just logging attempts blaming the 
upstream task and not taking any subsequent action.  Since we aggregate errors 
from an attempt it doesn't matter how many errors a single attempt is reporting 
(except for the 5-minute duration check).  And since this was only 4 attempts 
complaining out of hundreds of tasks for the vertex they couldn't collectively 
breach the failure fraction threshold.  Therefore the only hope was to get past 
the 5-minute limit, but each task attempt decided to fail itself before 
reaching that threshold.  Note that the task speculated which is why multiple 
attempts were running at the same time.
{noformat}
2016-04-01 23:23:51,615 [INFO] [Dispatcher thread {Speculator}] 
|legacy.LegacySpeculator|: DefaultSpeculator.addSpeculativeAttempt -- we are 
speculating task_1458300907858_465665_1_02_000857
2016-04-01 23:23:51,615 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: 
Scheduling a redundant attempt for task task_1458300907858_465665_1_02_000857
2016-04-01 23:23:51,615 [INFO] [TaskSchedulerEventHandlerThread] 
|rm.YarnTaskSchedulerService|: Allocation request for task: 
attempt_1458300907858_465665_1_02_000857_2 with request: 
Capability[<memory:4096, vCores:10>]Priority[7] host: null rack: null
2016-04-01 23:23:52,580 [INFO] [IPC Server handler 14 on 49422] 
|impl.VertexImpl|: Sending attempt_1458300907858_465665_1_02_000857_1 264 
events [17507,17771) total 17771 vertex_1458300907858_465665_1_02 [scope-191]
2016-04-01 23:23:58,692 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 16212
2016-04-01 23:24:01,694 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 13588
2016-04-01 23:24:02,883 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 16212
2016-04-01 23:24:16,729 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 16212
2016-04-01 23:24:17,893 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 13588
2016-04-01 23:24:19,730 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 13588
2016-04-01 23:24:20,895 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 16212
2016-04-01 23:24:22,732 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 16212
2016-04-01 23:24:23,898 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 13588
2016-04-01 23:24:28,743 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 13588
2016-04-01 23:24:29,900 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 16212
2016-04-01 23:24:31,745 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 16212
2016-04-01 23:24:32,903 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 13588
2016-04-01 23:24:35,905 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 16212
2016-04-01 23:24:37,748 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 13588
2016-04-01 23:24:43,786 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 16212
2016-04-01 23:24:47,911 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 16212
2016-04-01 23:24:49,789 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 13588
2016-04-01 23:24:54,656 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 13588
2016-04-01 23:24:54,656 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 16212
2016-04-01 23:24:54,656 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_2 at inputIndex 13588
2016-04-01 23:24:54,657 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1458300907858_465665_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=scope-191, taskAttemptId=attempt_1458300907858_465665_1_02_000857_2, 
creationTime=1459553031615, allocationTime=1459553031615, 
startTime=1459553031622, finishTime=1459553094656, timeTaken=63034, 
status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while 
running task:org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #58
2016-04-01 23:24:54,657 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Source task attempt completed for vertex: vertex_1458300907858_465665_1_03 
[scope-194] attempt: attempt_1458300907858_465665_1_02_000857_2 with state: 
FAILED vertexState: RUNNING
2016-04-01 23:24:55,792 [INFO] [Dispatcher thread {Speculator}] 
|legacy.LegacySpeculator|: DefaultSpeculator.addSpeculativeAttempt -- we are 
speculating task_1458300907858_465665_1_02_000857
2016-04-01 23:24:55,793 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: 
Scheduling a redundant attempt for task task_1458300907858_465665_1_02_000857
2016-04-01 23:24:55,793 [INFO] [TaskSchedulerEventHandlerThread] 
|rm.YarnTaskSchedulerService|: Allocation request for task: 
attempt_1458300907858_465665_1_02_000857_3 with request: 
Capability[<memory:4096, vCores:10>]Priority[7] host: null rack: null
22016-04-01 23:24:57,057 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 16212
2016-04-01 23:24:57,057 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_1 at inputIndex 13588
2016-04-01 23:24:57,057 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1458300907858_465665_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=scope-191, taskAttemptId=attempt_1458300907858_465665_1_02_000857_1, 
creationTime=1459553030536, allocationTime=1459553030538, 
startTime=1459553030571, finishTime=1459553097057, timeTaken=66486, 
status=FAILED, errorEnum=FRAMEWORK_ERROR, diagnostics=Error: Failure while 
running task:org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #33
2016-04-01 23:24:57,057 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Source task attempt completed for vertex: vertex_1458300907858_465665_1_03 
[scope-194] attempt: attempt_1458300907858_465665_1_02_000857_1 with state: 
FAILED vertexState: RUNNING
[...]
2016-04-01 23:25:00,550 [INFO] [IPC Server handler 4 on 49422] 
|impl.VertexImpl|: Sending attempt_1458300907858_465665_1_02_000857_3 268 
events [17503,17771) total 17771 vertex_1458300907858_465665_1_02 [scope-191]
2016-04-01 23:25:12,601 [INFO] [Dispatcher thread {Speculator}] 
|legacy.LegacySpeculator|: DefaultSpeculator.addSpeculativeAttempt -- we are 
speculating task_1458300907858_465665_1_02_000857
2016-04-01 23:25:12,601 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: 
Scheduling a redundant attempt for task task_1458300907858_465665_1_02_000857
2016-04-01 23:25:12,601 [INFO] [TaskSchedulerEventHandlerThread] 
|rm.YarnTaskSchedulerService|: Allocation request for task: 
attempt_1458300907858_465665_1_02_000857_4 with request: 
Capability[<memory:4096, vCores:10>]Priority[7] host: null rack: null
2016-04-01 23:25:16,949 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_4 at inputIndex 16212
2016-04-01 23:25:18,608 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:18,608 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:24,613 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:27,615 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:30,629 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:31,971 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_4 at inputIndex 13588
2016-04-01 23:25:31,971 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_4 at inputIndex 16212
2016-04-01 23:25:36,632 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:37,974 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_4 at inputIndex 13588
2016-04-01 23:25:40,975 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_4 at inputIndex 16212
2016-04-01 23:25:42,635 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:43,977 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_4 at inputIndex 13588
2016-04-01 23:25:45,637 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:49,980 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_4 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_016212_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 16212
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|impl.TaskAttemptImpl|: attempt_1458300907858_465665_1_00_013588_0 blamed for 
read error from attempt_1458300907858_465665_1_02_000857_3 at inputIndex 13588
2016-04-01 23:25:53,211 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1458300907858_465665_1][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=scope-191, taskAttemptId=attempt_1458300907858_465665_1_02_000857_3, 
creationTime=1459553095793, allocationTime=1459553095793, 
startTime=1459553095945, finishTime=1459553153211, timeTaken=57266, 
status=FAILED, errorEnum=INPUT_READ_ERROR, diagnostics=Error: 
exceptionThrown=org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #31
2016-04-01 23:25:53,212 [INFO] [Dispatcher thread {Central}] |impl.TaskImpl|: 
Failing task: task_1458300907858_465665_1_02_000857, currentFailedAttempts: 4, 
maxFailedAttempts: 4
2016-04-01 23:25:53,212 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1458300907858_465665_1][Event:TASK_FINISHED]: 
vertexName=scope-191, taskId=task_1458300907858_465665_1_02_000857, 
startTime=1459545475064, finishTime=1459553153212, timeTaken=7678148, 
status=FAILED, successfulAttemptID=null, diagnostics=TaskAttempt 0 failed, 
info=[attempt_1458300907858_465665_1_02_000857_0 being failed for too many 
output errors. failureFraction=0.05, MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, 
uniquefailedOutputReports=10, MAX_ALLOWED_OUTPUT_FAILURES=10, 
MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, readErrorTimespan=0]
2016-04-01 23:25:53,212 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: 
Failing vertex: vertex_1458300907858_465665_1_02 [scope-191] because task 
failed: task_1458300907858_465665_1_02_000857
2016-04-01 23:25:53,271 [INFO] [Dispatcher thread {Central}] 
|history.HistoryEventHandler|: 
[HISTORY][DAG:dag_1458300907858_465665_1][Event:VERTEX_FINISHED]: 
vertexName=scope-191, vertexId=vertex_1458300907858_465665_1_02, 
initRequestedTime=1459544441828, initedTime=1459544441883, 
startRequestedTime=1459544442182, startedTime=1459544442183, 
finishTime=1459553153258, timeTaken=8711075, status=FAILED, diagnostics=Task 
failed, taskId=task_1458300907858_465665_1_02_000857, diagnostics=[TaskAttempt 
0 failed, info=[attempt_1458300907858_465665_1_02_000857_0 being failed for too 
many output errors. failureFraction=0.05, 
MAX_ALLOWED_OUTPUT_FAILURES_FRACTION=0.1, uniquefailedOutputReports=10, 
MAX_ALLOWED_OUTPUT_FAILURES=10, MAX_ALLOWED_TIME_FOR_TASK_READ_ERROR_SEC=300, 
readErrorTimespan=0], TaskAttempt 2 failed, info=[Error: Failure while running 
task:org.apache.pig.backend.executionengine.ExecException: ERROR 0: 
org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
 error in shuffle in fetcher {scope_177} #58
{noformat}

bq. Can you please try the job with 
"tez.runtime.shuffle.failed.check.since-last.completion=false"
Since this was a transient shuffle error a single re-run with this setting is 
unlikely to show anything useful unless the same shuffle problem occurs again.  
Instead we're planning on setting that property to false across our jobs to see 
if the shuffle-fetch-failure-kills-job rate decreases significantly.

> Shuffle failures for the trailing task in a vertex are often fatal to the 
> entire DAG
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-3198
>                 URL: https://issues.apache.org/jira/browse/TEZ-3198
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.0, 0.8.2
>            Reporter: Jason Lowe
>            Priority: Critical
>
> I've seen an increasing number of cases where a single-node failure caused 
> the whole Tez DAG to fail. These scenarios are common in that they involve 
> the last task of a vertex attempting to complete a shuffle where all the peer 
> tasks have already finished shuffling.  The last task's attempt encounters 
> errors shuffling one of its inputs and keeps reporting it to the AM.  
> Eventually the attempt decides it must be the cause of the shuffle error and 
> fails.  The subsequent attempts all do the same thing, and eventually we hit 
> the task max attempts limit and fail the vertex and DAG.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to