[jira] [Commented] (FLINK-33565) The concurrentExceptions doesn't work

Rui Fan (Jira) Sun, 07 Jan 2024 18:38:52 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804091#comment-17804091
 ]


Rui Fan commented on FLINK-33565:
---------------------------------

Hey [~mapohl] , I have summited the PR[1] for this JIRA. I didn't finish the 
detailed test due to I wanna check with you whether the solution is fine.

Would you mind helping take a look this PR in your free time? It's better to 
finish it in 1.19, thanks~ :)

 

[1]https://github.com/apache/flink/pull/24003

> The concurrentExceptions doesn't work
> -------------------------------------
>
>                 Key: FLINK-33565
>                 URL: https://issues.apache.org/jira/browse/FLINK-33565
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.18.0, 1.17.1
>            Reporter: Rui Fan
>            Assignee: Rui Fan
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.19.0
>
>         Attachments: screenshot-1.png
>
>
> First of all, thanks to [~mapohl] for helping double-check in advance that 
> this was indeed a bug .
> Displaying exception history in WebUI is supported in FLINK-6042.
> h1. What's the concurrentExceptions?
> When an execution fails due to an exception, other executions in the same 
> region will also restart, and the first Exception is rootException. If other 
> restarted executions also report Exception at this time, we hope to collect 
> these exceptions and Displayed to the user as concurrentExceptions.
> h2. What's this bug?
> The concurrentExceptions is always empty in production, even if other 
> executions report exception at very close times.
> h1. Why doesn't it work?
> If one job has all-to-all shuffle, this job only has one region, and this 
> region has a lot of executions. If one execution throw exception:
>  * JobMaster will mark the state as FAILED for this execution.
>  * The rest of executions of this region will be marked to CANCELING.
>  ** This call stack can be found at FLIP-364 
> [part-4.2.3|https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+restart-strategy#FLIP364:Improvetherestartstrategy-4.2.3Detailedcodeforfull-failover]
>  
> When these executions throw exception as well, it JobMaster will mark the 
> state from CANCELING to CANCELED instead of FAILED.
> The CANCELED execution won't call FAILED logic, so their exceptions are 
> ignored.
> Note: all reports are executed inside of JobMaster RPC thread, it's single 
> thread. So these reports are executed serially. So only one execution is 
> marked to FAILED, and the rest of executions will be marked to CANCELED later.
> h1. How to fix it?
> Offline discuss with [~mapohl] , we need to discuss with community should we 
> keep the concurrentExceptions first.
>  * If no, we can remove related logic directly
>  * If yew, we discuss how to fix it later.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33565) The concurrentExceptions doesn't work

Reply via email to