[
https://issues.apache.org/jira/browse/FLINK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788039#comment-17788039
]
Matthias Pohl commented on FLINK-33565:
---------------------------------------
Thanks for raising the issue, [~fanrui] . I had a brief discussion with
[~chesnay] on that topic:
There's a difference between the {{Default-}} and the
{{{}AdaptiveScheduler{}}}. The latter one doesn't support pipelined regions.
The {{DefaultScheduler}} does support them. Therefore, concurrent exceptions
should happen when using the {{{}AdaptiveScheduler{}}}. But there was an issue
in the past that cannot be explained till now where concurrent exceptions
caused an issue in a run that had the {{AdaptiveScheduler}} enabled (see
FLINK-33121). So far, they looked into it but struggled to find the cause for
this.
On the other hand, {{DefaultScheduler}} comes with pipelined region support.
The scenario that they have considered when thinking about concurrent
exceptions was that you can have two pipelined regions being executed
concurrently. They are both failing independently with one of the two errors
becoming the root cause for the job's failure. The
{{PipelinedRegionSchedulingStrategy}} is in charge of scheduling vertex
restarts. Apparently, it would be possible to put the vertices of two different
pipelines together to reduce the number of restarts.
I looked into the code of \{{PipelinedRegionSchedulingStrategy#restartTasks}}.
I struggled to find the merge behavior, though. Based on my finding, the
\{{PipelinedRegionSchedulingStrategy}} does indeed merge pipelined regions
together. But only based on the vertices that are already selected for a
restart. Because of this, I'm not sure whether the conclusion Chesnay and I
came up with in the first place is correct. I wanted to share it, anyway. I'm
wondering whether you find a mistake in our reasoning.
> The concurrentExceptions doesn't work
> -------------------------------------
>
> Key: FLINK-33565
> URL: https://issues.apache.org/jira/browse/FLINK-33565
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.18.0, 1.17.1
> Reporter: Rui Fan
> Assignee: Rui Fan
> Priority: Major
>
> First of all, thanks to [~mapohl] for helping double-check in advance that
> this was indeed a bug .
> Displaying exception history in WebUI is supported in FLINK-6042.
> h1. What's the concurrentExceptions?
> When an execution fails due to an exception, other executions in the same
> region will also restart, and the first Exception is rootException. If other
> restarted executions also report Exception at this time, we hope to collect
> these exceptions and Displayed to the user as concurrentExceptions.
> h2. What's this bug?
> The concurrentExceptions is always empty in production, even if other
> executions report exception at very close times.
> h1. Why doesn't it work?
> If one job has all-to-all shuffle, this job only has one region, and this
> region has a lot of executions. If one execution throw exception:
> * JobMaster will mark the state as FAILED for this execution.
> * The rest of executions of this region will be marked to CANCELING.
> ** This call stack can be found at FLIP-364
> [part-4.2.3|https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy#FLIP364:Improvetherestartstrategy-4.2.3Detailedcodeforfull-failover]
>
> When these executions throw exception as well, it JobMaster will mark the
> state from CANCELING to CANCELED instead of FAILED.
> The CANCELED execution won't call FAILED logic, so their exceptions are
> ignored.
> Note: all reports are executed inside of JobMaster RPC thread, it's single
> thread. So these reports are executed serially. So only one execution is
> marked to FAILED, and the rest of executions will be marked to CANCELED later.
> h1. How to fix it?
> Offline discuss with [~mapohl] , we need to discuss with community should we
> keep the concurrentExceptions first.
> * If no, we can remove related logic directly
> * If yew, we discuss how to fix it later.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)