[
https://issues.apache.org/jira/browse/FLINK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788697#comment-17788697
]
Matthias Pohl commented on FLINK-33565:
---------------------------------------
{quote}
I don't understand why concurrent exceptions should happen when using the
AdaptiveScheduler. When one job only has all-to-all shuffle, AdaptiveScheduler
and DefaultScheduler should have similar exception-related logic, right?
{quote}
That's correct. It was a typo in my comment. I meant that "concurrent
exceptions should NOT happen when using the AdaptiveScheduler". I corrected it
in my comment above.
> The concurrentExceptions doesn't work
> -------------------------------------
>
> Key: FLINK-33565
> URL: https://issues.apache.org/jira/browse/FLINK-33565
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.18.0, 1.17.1
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
>
> First of all, thanks to [~mapohl] for helping double-check in advance that
> this was indeed a bug .
> Displaying exception history in WebUI is supported in FLINK-6042.
> h1. What's the concurrentExceptions?
> When an execution fails due to an exception, other executions in the same
> region will also restart, and the first Exception is rootException. If other
> restarted executions also report Exception at this time, we hope to collect
> these exceptions and Displayed to the user as concurrentExceptions.
> h2. What's this bug?
> The concurrentExceptions is always empty in production, even if other
> executions report exception at very close times.
> h1. Why doesn't it work?
> If one job has all-to-all shuffle, this job only has one region, and this
> region has a lot of executions. If one execution throw exception:
> * JobMaster will mark the state as FAILED for this execution.
> * The rest of executions of this region will be marked to CANCELING.
> ** This call stack can be found at FLIP-364
> [part-4.2.3|https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy#FLIP364:Improvetherestartstrategy-4.2.3Detailedcodeforfull-failover]
>
> When these executions throw exception as well, it JobMaster will mark the
> state from CANCELING to CANCELED instead of FAILED.
> The CANCELED execution won't call FAILED logic, so their exceptions are
> ignored.
> Note: all reports are executed inside of JobMaster RPC thread, it's single
> thread. So these reports are executed serially. So only one execution is
> marked to FAILED, and the rest of executions will be marked to CANCELED later.
> h1. How to fix it?
> Offline discuss with [~mapohl] , we need to discuss with community should we
> keep the concurrentExceptions first.
> * If no, we can remove related logic directly
> * If yew, we discuss how to fix it later.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)