[
https://issues.apache.org/jira/browse/TEZ-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
László Bodor resolved TEZ-4336.
-------------------------------
Resolution: Fixed
> ShuffleScheduler should try to report the original exception (when shuffle
> becomes unhealthy)
> ---------------------------------------------------------------------------------------------
>
> Key: TEZ-4336
> URL: https://issues.apache.org/jira/browse/TEZ-4336
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: László Bodor
> Assignee: László Bodor
> Priority: Major
> Fix For: 0.10.2
>
> Attachments: TEZ_4336_client_output.txt
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> In a client log, I can something like:
> {code}
> ERROR : FAILED: Execution Error, return code 2 from
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map
> 1, vertexId=vertex_1632183109176_0005_8_03Vertex re-running, vertexName=Map
> 2, vertexId=vertex_1632183109176_0005_8_04Vertex failed, vertexName=Reducer
> 3, vertexId=vertex_1632183109176_0005_8_05, diagnostics=[Task failed,
> taskId=task_1632183109176_0005_8_05_000032, diagnostics=[TaskAttempt 0
> killed, TaskAttempt 1 killed, TaskAttempt 2 killed, TaskAttempt 3 killed,
> TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt
> 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed,
> TaskAttempt 11 killed, TaskAttempt 12 killed, TaskAttempt 13 failed,
> info=[AttemptID:attempt_1632183109176_0005_8_05_000032_13 Timed out after 300
> secs], TaskAttempt 14 killed, TaskAttempt 15 killed, TaskAttempt 16 killed,
> TaskAttempt 17 killed, TaskAttempt 18 killed, TaskAttempt 19 killed,
> TaskAttempt 20 killed, TaskAttempt 21 killed, TaskAttempt 22 killed,
> TaskAttempt 23 killed, TaskAttempt 24 killed, TaskAttempt 25 killed,
> TaskAttempt 26 killed, TaskAttempt 27 killed, TaskAttempt 28 killed,
> TaskAttempt 29 killed, TaskAttempt 30 killed, TaskAttempt 31 killed,
> TaskAttempt 32 killed, TaskAttempt 33 killed, TaskAttempt 34 killed,
> TaskAttempt 35 killed, TaskAttempt 36 killed, TaskAttempt 37 killed,
> TaskAttempt 38 killed, TaskAttempt 39 killed, TaskAttempt 40 killed,
> TaskAttempt 41 killed, TaskAttempt 42 killed, TaskAttempt 43 killed,
> TaskAttempt 44 killed, TaskAttempt 45 killed, TaskAttempt 46 killed,
> TaskAttempt 47 killed, TaskAttempt 48 killed, TaskAttempt 49 killed,
> TaskAttempt 50 killed, TaskAttempt 51 killed, TaskAttempt 52 killed,
> TaskAttempt 53 killed, TaskAttempt 54 killed, TaskAttempt 55 killed,
> TaskAttempt 56 killed, TaskAttempt 57 killed, TaskAttempt 58 killed,
> TaskAttempt 59 killed, TaskAttempt 60 killed, TaskAttempt 61 killed,
> TaskAttempt 62 killed, TaskAttempt 63 killed, TaskAttempt 64 killed,
> TaskAttempt 65 killed, TaskAttempt 66 killed, TaskAttempt 67 killed,
> TaskAttempt 68 killed, TaskAttempt 69 killed, TaskAttempt 70 killed,
> TaskAttempt 71 killed, TaskAttempt 72 killed, TaskAttempt 73 killed,
> TaskAttempt 74 killed, TaskAttempt 75 killed, TaskAttempt 76 killed,
> TaskAttempt 77 killed, TaskAttempt 78 killed, TaskAttempt 79 killed,
> TaskAttempt 80 killed, TaskAttempt 81 killed, TaskAttempt 82 killed,
> TaskAttempt 83 killed, TaskAttempt 84 killed, TaskAttempt 85 killed,
> TaskAttempt 86 killed, TaskAttempt 87 killed, TaskAttempt 88 killed,
> TaskAttempt 89 killed, TaskAttempt 90 killed, TaskAttempt 91 killed,
> TaskAttempt 92 killed, TaskAttempt 93 killed, TaskAttempt 94 killed,
> TaskAttempt 95 killed, TaskAttempt 96 failed, info=[Error: Error while
> running task ( failure ) :
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
> error in shuffle in Fetcher_O {Map_2} #13
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
> at
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Map_2: Shuffle failed with too many fetch
> failures and insufficient progress!failureCounts=14, pendingInputs=4130,
> fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1055)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:793)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:392)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
> ... 7 more
> , errorMessage=Shuffle Runner
> Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
> error in shuffle in Fetcher_O {Map_2} #13
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
> at
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
> at
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
> at
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> Shuffle failed with too many fetch failures and insufficient progress
> !failureCounts=14 means that the underlying exception wasn't reported, only
> the shuffle failure, it would be good the have some details
> here, isShuffleHealthy simply creates an exception:
> https://github.com/apache/tez/blob/5eeccf0e318e22cdcbbe202a9f554f93d138c207/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleScheduler.java#L1059
> what if we stored the last exception (usually, most of them have the same
> root cause) and wrap it somehow into this IOException
--
This message was sent by Atlassian Jira
(v8.3.4#803005)