[ 
https://issues.apache.org/jira/browse/FLINK-30505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652097#comment-17652097
 ] 

Xintong Song commented on FLINK-30505:
--------------------------------------

On the contrary, I think this can be misleading sometimes. Upon TM becoming 
unreachable, RM will try to remove the pod/container from K8s/Yarn anyway. That 
means the TM may still be alive while being unreachable (stuck, network 
problems, etc.), and will then be terminated by K8s. Thus the termination of 
the pod can be either the cause or the result of TM being unreachable.

> Close the connection between TM and JM when task executor failed
> ----------------------------------------------------------------
>
>                 Key: FLINK-30505
>                 URL: https://issues.apache.org/jira/browse/FLINK-30505
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Task
>    Affects Versions: 1.16.0
>            Reporter: Yongming Zhang
>            Priority: Major
>             Fix For: 1.17.0
>
>
> When resource manager detects a task executor has failed, it will close 
> connection with task executor. At this time,jobs running on this tm will fail 
> for other reasons(no longger reachable or heartbeat timeout).
> !https://intranetproxy.alipay.com/skylark/lark/0/2022/png/336411/1672047809511-a4b8b5d9-f11f-483c-a113-b42290a33250.png|width=1160,id=uc24b1166!
> If close the connection between task executor and job master when resource 
> manager detects a task executor has failed,the real reason for task executor 
> failure will appear in "Root Exception".This will make it easier for users to 
> find problems.
> !https://intranetproxy.alipay.com/skylark/lark/0/2022/png/336411/1672048733572-2b5b7be4-087d-46ae-9c8d-6ad5a1344019.png|width=1141,id=u947d8c4e!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to