[jira] [Commented] (FLINK-33565) The concurrentExceptions doesn't work

Matthias Pohl (Jira) Wed, 22 Nov 2023 01:58:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788697#comment-17788697
 ]


Matthias Pohl commented on FLINK-33565:
---------------------------------------

{quote}
I don't understand why concurrent exceptions should happen when using the 
AdaptiveScheduler. When one job only has all-to-all shuffle, AdaptiveScheduler 
and DefaultScheduler should have similar exception-related logic, right?
{quote}
That's correct. It was a typo in my comment. I meant that "concurrent 
exceptions should NOT happen when using the AdaptiveScheduler". I corrected it 
in my comment above.

> The concurrentExceptions doesn't work
> -------------------------------------
>
>                 Key: FLINK-33565
>                 URL: https://issues.apache.org/jira/browse/FLINK-33565
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.0, 1.17.1
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>
> First of all, thanks to [~mapohl] for helping double-check in advance that 
> this was indeed a bug .
> Displaying exception history in WebUI is supported in FLINK-6042.
> h1. What's the concurrentExceptions?
> When an execution fails due to an exception, other executions in the same 
> region will also restart, and the first Exception is rootException. If other 
> restarted executions also report Exception at this time, we hope to collect 
> these exceptions and Displayed to the user as concurrentExceptions.
> h2. What's this bug?
> The concurrentExceptions is always empty in production, even if other 
> executions report exception at very close times.
> h1. Why doesn't it work?
> If one job has all-to-all shuffle, this job only has one region, and this 
> region has a lot of executions. If one execution throw exception:
>  * JobMaster will mark the state as FAILED for this execution.
>  * The rest of executions of this region will be marked to CANCELING.
>  ** This call stack can be found at FLIP-364 
> [part-4.2.3|https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy#FLIP364:Improvetherestartstrategy-4.2.3Detailedcodeforfull-failover]
>  
> When these executions throw exception as well, it JobMaster will mark the 
> state from CANCELING to CANCELED instead of FAILED.
> The CANCELED execution won't call FAILED logic, so their exceptions are 
> ignored.
> Note: all reports are executed inside of JobMaster RPC thread, it's single 
> thread. So these reports are executed serially. So only one execution is 
> marked to FAILED, and the rest of executions will be marked to CANCELED later.
> h1. How to fix it?
> Offline discuss with [~mapohl] , we need to discuss with community should we 
> keep the concurrentExceptions first.
>  * If no, we can remove related logic directly
>  * If yew, we discuss how to fix it later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33565) The concurrentExceptions doesn't work

Reply via email to