[
https://issues.apache.org/jira/browse/FLINK-17726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279743#comment-17279743
]
Piotr Nowojski edited comment on FLINK-17726 at 2/5/21, 3:13 PM:
-----------------------------------------------------------------
We have discovered this issue once we introduced a bug, that caused Task switch
to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and
deadlocked the job.
This led me to some discussion with [~trohrmann] about how to handle this kind
of issues. For me the most important part would be to not have a regression in
the way, how we are reporting the root/primary/real cause of the failure.
Currently switching from `RUNNING` -> `CANCELLED` state is a valid thing to do
for the task, if this is a "secondary" failure caused by upstream/downstream
task issue. This currently allows JobManager to easily ignore those "secondary"
failures, from the real failures and pick first reported "real" failure as the
root cause of the job/region failure.
If we followed the proposed here in the task solution, to not allow the
`RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular
"primary" failure, I would expect user to be flooded with hundreds of secondary
failures. In that case it would be extremely difficult for him to figure out
what has happened. Primary example: "real" failure is a loss of Task Manager
that was either not detected, or would be detected after heartbeat timeout,
which caused hundreds/thousands "secondary" failures (currently `RUNNING` ->
`CANCELLED` transitions).
My proposal how to deal with this situation, would be to keep the distinction
of the "secondary" failure, but also enrich it with the information which task
was the reason behind. JobManager would receive information "Task B1 failed
because something has happened to with the Task A1".
That would let us do two things:
* If Job Manager managed to detect some primary failure, it could ignore (or
batch together) all of the secondary failures
* If no primary failure was detected, and we want to failover the job without
waiting for example for the heartbeat pointing to the primary failure, Job
Manager could connect secondary failures and the DAG, to deduce that something
bad has happened with the "Task B1"
was (Author: pnowojski):
We have discovered this issue once we introduced a bug, that caused Task switch
to `CANCELLED` state incorrectly (it should switched to `FINISHED`) and
deadlocked the job.
This led me to have some discussion with [~trohrmann] about a good way how to
handle this kind of issues. For me the most important part would be to not have
a regression in the way, how we are reporting the root/primary/real cause of
the failure. Currently switching from `RUNNING` -> `CANCELLED` state is a valid
thing to do for the task, if this is a "secondary" failure caused by
upstream/downstream task issue. This currently allows JobManager to easily
ignore those "secondary" failures, from the real failures and pick first
reported "real" failure as the root cause of the job/region failure.
If we followed the proposed here in the task solution, to not allow the
`RUNNING` -> `CANCELLED` transition, but just simply treat it as a regular
"primary" failure, I would expect user to be flooded with hundreds of secondary
failures, which would be extremely difficult for him to figure out what has
happened. Primary example: "real" failure is a loss of Task Manager that was
either not detected, or would be detected after heartbeat timeout, which caused
hundreds/thousands "secondary" failures (currently `RUNNING` -> `CANCELLED`
transitions).
My proposal how to deal with this situation, would be to keep the distinction
of the "secondary" failure, but also enrich it with the information which task
was the reason behind. JobManager would receive information "Task B1 failed
because something has happened to with the Task A1".
That would let us do two things:
* If JobManager managed to detect some primary failure, it could ignore (or
batch together) all of the secondary failures
* If no primary failure was detected, and we want to failover the job without
waiting for example for the heartbeat pointing to the primary failure, Job
Manager could connect secondary failures and the DAG, to deduce that something
bad has happened with the "Task B1"
> Scheduler should take care of tasks directly canceled by TaskManager
> --------------------------------------------------------------------
>
> Key: FLINK-17726
> URL: https://issues.apache.org/jira/browse/FLINK-17726
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Task
> Affects Versions: 1.11.0, 1.12.0
> Reporter: Zhu Zhu
> Priority: Critical
>
> JobManager will not trigger failure handling when receiving CANCELED task
> update.
> This is because CANCELED tasks are usually caused by another FAILED task.
> These CANCELED tasks will be restarted by the failover process triggered
> FAILED task.
> However, if a task is directly CANCELED by TaskManager due to its own runtime
> issue, the task will not be recovered by JM and thus the job would hang.
> This is a potential issue and we should avoid it.
> A possible solution is to let JobManager treat tasks transitioning to
> CANCELED from all states except from CANCELING as failed tasks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)