[
https://issues.apache.org/jira/browse/FLINK-30505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652132#comment-17652132
]
Xintong Song commented on FLINK-30505:
--------------------------------------
bq. If the termination of the pod is caused by TM being unreachable, the first
exception thrown is tm unreachable,this does not mislead users.
The exception that causes the execution state transiting to FAILED would be the
termination of the pod, if we close JM-TM connection when RM-TM connection is
closed.
I think this changes the protocol between JM / RM / TM, in a way that RM can
control the connection between JM and TM. It also adds a O(numJM * numTM)
overhead in the RM RPC main thread. With these prices, the benefit is unclear
to me. I'm not convinced that a pod termination exception is more suitable than
a TM unreachable exception for the "Root Exception" on UI.
So I'm overall -1 to this proposal.
> Close the connection between TM and JM when task executor failed
> ----------------------------------------------------------------
>
> Key: FLINK-30505
> URL: https://issues.apache.org/jira/browse/FLINK-30505
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Task
> Affects Versions: 1.16.0
> Reporter: Yongming Zhang
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.17.0
>
>
> When resource manager detects a task executor has failed, it will close
> connection with task executor. At this time,jobs running on this tm will fail
> for other reasons(no longger reachable or heartbeat timeout).
> !https://intranetproxy.alipay.com/skylark/lark/0/2022/png/336411/1672047809511-a4b8b5d9-f11f-483c-a113-b42290a33250.png|width=1160,id=uc24b1166!
> If close the connection between task executor and job master when resource
> manager detects a task executor has failed,the real reason for task executor
> failure will appear in "Root Exception".This will make it easier for users to
> find problems.
> !https://intranetproxy.alipay.com/skylark/lark/0/2022/png/336411/1672048733572-2b5b7be4-087d-46ae-9c8d-6ad5a1344019.png|width=1141,id=u947d8c4e!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)