[jira] [Resolved] (TEZ-4336) ShuffleScheduler should try to report the original exception (when shuffle becomes unhealthy)

Jira Tue, 02 Nov 2021 10:28:07 -0700


     [ 
https://issues.apache.org/jira/browse/TEZ-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


László Bodor resolved TEZ-4336.
-------------------------------
    Resolution: Fixed

> ShuffleScheduler should try to report the original exception (when shuffle 
> becomes unhealthy)
> ---------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4336
>                 URL: https://issues.apache.org/jira/browse/TEZ-4336
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.2
>
>         Attachments: TEZ_4336_client_output.txt
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In a client log, I can something like:
> {code}
> ERROR : FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 
> 1, vertexId=vertex_1632183109176_0005_8_03Vertex re-running, vertexName=Map 
> 2, vertexId=vertex_1632183109176_0005_8_04Vertex failed, vertexName=Reducer 
> 3, vertexId=vertex_1632183109176_0005_8_05, diagnostics=[Task failed, 
> taskId=task_1632183109176_0005_8_05_000032, diagnostics=[TaskAttempt 0 
> killed, TaskAttempt 1 killed, TaskAttempt 2 killed, TaskAttempt 3 killed, 
> TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 
> 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, 
> TaskAttempt 11 killed, TaskAttempt 12 killed, TaskAttempt 13 failed, 
> info=[AttemptID:attempt_1632183109176_0005_8_05_000032_13 Timed out after 300 
> secs], TaskAttempt 14 killed, TaskAttempt 15 killed, TaskAttempt 16 killed, 
> TaskAttempt 17 killed, TaskAttempt 18 killed, TaskAttempt 19 killed, 
> TaskAttempt 20 killed, TaskAttempt 21 killed, TaskAttempt 22 killed, 
> TaskAttempt 23 killed, TaskAttempt 24 killed, TaskAttempt 25 killed, 
> TaskAttempt 26 killed, TaskAttempt 27 killed, TaskAttempt 28 killed, 
> TaskAttempt 29 killed, TaskAttempt 30 killed, TaskAttempt 31 killed, 
> TaskAttempt 32 killed, TaskAttempt 33 killed, TaskAttempt 34 killed, 
> TaskAttempt 35 killed, TaskAttempt 36 killed, TaskAttempt 37 killed, 
> TaskAttempt 38 killed, TaskAttempt 39 killed, TaskAttempt 40 killed, 
> TaskAttempt 41 killed, TaskAttempt 42 killed, TaskAttempt 43 killed, 
> TaskAttempt 44 killed, TaskAttempt 45 killed, TaskAttempt 46 killed, 
> TaskAttempt 47 killed, TaskAttempt 48 killed, TaskAttempt 49 killed, 
> TaskAttempt 50 killed, TaskAttempt 51 killed, TaskAttempt 52 killed, 
> TaskAttempt 53 killed, TaskAttempt 54 killed, TaskAttempt 55 killed, 
> TaskAttempt 56 killed, TaskAttempt 57 killed, TaskAttempt 58 killed, 
> TaskAttempt 59 killed, TaskAttempt 60 killed, TaskAttempt 61 killed, 
> TaskAttempt 62 killed, TaskAttempt 63 killed, TaskAttempt 64 killed, 
> TaskAttempt 65 killed, TaskAttempt 66 killed, TaskAttempt 67 killed, 
> TaskAttempt 68 killed, TaskAttempt 69 killed, TaskAttempt 70 killed, 
> TaskAttempt 71 killed, TaskAttempt 72 killed, TaskAttempt 73 killed, 
> TaskAttempt 74 killed, TaskAttempt 75 killed, TaskAttempt 76 killed, 
> TaskAttempt 77 killed, TaskAttempt 78 killed, TaskAttempt 79 killed, 
> TaskAttempt 80 killed, TaskAttempt 81 killed, TaskAttempt 82 killed, 
> TaskAttempt 83 killed, TaskAttempt 84 killed, TaskAttempt 85 killed, 
> TaskAttempt 86 killed, TaskAttempt 87 killed, TaskAttempt 88 killed, 
> TaskAttempt 89 killed, TaskAttempt 90 killed, TaskAttempt 91 killed, 
> TaskAttempt 92 killed, TaskAttempt 93 killed, TaskAttempt 94 killed, 
> TaskAttempt 95 killed, TaskAttempt 96 failed, info=[Error: Error while 
> running task ( failure ) : 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
>  error in shuffle in Fetcher_O {Map_2} #13
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
>       at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>       at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Map_2: Shuffle failed with too many fetch 
> failures and insufficient progress!failureCounts=14, pendingInputs=4130, 
> fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1055)
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:793)
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:392)
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
>       ... 7 more
> , errorMessage=Shuffle Runner 
> Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError:
>  error in shuffle in Fetcher_O {Map_2} #13
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
>       at 
> org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
>       at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
>       at 
> com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
>       at 
> com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> Shuffle failed with too many fetch failures and insufficient progress 
> !failureCounts=14 means that the underlying exception wasn't reported, only 
> the shuffle failure, it would be good the have some details
> here, isShuffleHealthy simply creates an exception:
> https://github.com/apache/tez/blob/5eeccf0e318e22cdcbbe202a9f554f93d138c207/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleScheduler.java#L1059
> what if we stored the last exception (usually, most of them have the same 
> root cause) and wrap it somehow into this IOException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (TEZ-4336) ShuffleScheduler should try to report the original exception (when shuffle becomes unhealthy)

Reply via email to