[
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17280409#comment-17280409
]
Zhu Zhu edited comment on FLINK-17726 at 2/7/21, 8:23 AM:
----------------------------------------------------------
Thanks for relaunching this discussion and proposing a solution! [~pnowojski]
I'd like to double confirm the proposal. Please correct me if I understand it
incorrectly:
1. task should not be CANCELLED in TM unless it was CANCELING. It should be
transitioned into FAILED with a "secondary" failure with the information of the
root cause task
2. JM triggers failovers on "primary" failures and ignores related secondary
failures. For"secondary" failures, given that the related "primary" failure
should always be reported sooner or later, JM can simply mark the task as
CANCELED and skip the failure handling. To further improve it, JM can register
a timeout on secondary failures in case that the related "primary" failure is
not reported, or to speed up the recover without waiting for a heartbeat
timeout.
3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING
to CANCELED in TM, which is never expected to happen though after the work of #1
was (Author: zhuzh):
Thanks for relaunching this discussion and proposing a solution.
I'd like to double confirm the proposal. Please correct me if I understand it
incorrectly:
1. task should not be CANCELLED in TM unless it was CANCELING. It should be
transitioned into FAILED with a "secondary" failure with the information of the
root cause task
2. JM triggers failovers on "primary" failures and ignores related secondary
failures. For"secondary" failures, given that the related "primary" failure
should always be reported sooner or later, JM can simply mark the task as
CANCELED and skip the failure handling. To further improve it, JM can register
a timeout on secondary failures in case that the related "primary" failure is
not reported, or to speed up the recover without waiting for a heartbeat
timeout.
3. JM triggers a failover if a task directly transitions from DEPLOYING/RUNNING
to CANCELED in TM, which is never expected to happen though after the work of #1
> Scheduler should take care of tasks directly canceled by TaskManager
> --------------------------------------------------------------------
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Task
> Affects Versions: 1.11.0, 1.12.0
> Reporter: Zhu Zhu
> Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task
> update.
> This is because CANCELED tasks are usually caused by another FAILED task.
> These CANCELED tasks will be restarted by the failover process triggered
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to
> CANCELED from all states except from CANCELING as failed tasks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)