[ 
https://issues.apache.org/jira/browse/FLINK-30505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652111#comment-17652111
 ] 

Yongming Zhang commented on FLINK-30505:
----------------------------------------

If the termination of the pod is caused by TM being unreachable,the first 
exception thrown is tm unreachable,this does not mislead users.On the other 
hand, close the connection between task executor and job master when resource 
manager detects a task executor has failed,jobs can failover earlier.

> Close the connection between TM and JM when task executor failed
> ----------------------------------------------------------------
>
>                 Key: FLINK-30505
>                 URL: https://issues.apache.org/jira/browse/FLINK-30505
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.16.0
>            Reporter: Yongming Zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.17.0
>
>
> When resource manager detects a task executor has failed, it will close 
> connection with task executor. At this time,jobs running on this tm will fail 
> for other reasons(no longger reachable or heartbeat timeout).
> !https://intranetproxy.alipay.com/skylark/lark/0/2022/png/336411/1672047809511-a4b8b5d9-f11f-483c-a113-b42290a33250.png|width=1160,id=uc24b1166!
> If close the connection between task executor and job master when resource 
> manager detects a task executor has failed,the real reason for task executor 
> failure will appear in "Root Exception".This will make it easier for users to 
> find problems.
> !https://intranetproxy.alipay.com/skylark/lark/0/2022/png/336411/1672048733572-2b5b7be4-087d-46ae-9c8d-6ad5a1344019.png|width=1141,id=u947d8c4e!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to