[
https://issues.apache.org/jira/browse/FLINK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Zhu Zhu updated FLINK-21707:
----------------------------
Description:
Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING
consumers. This is because
{{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to
schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*,
even though the regions are not the exact consumers of the finished
*ExecutionVertex*. In this case, some of the regions can be in non-CREATED
state because they are not connected to nor affected by the restarted tasks.
However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not
allow to schedule a non-CREATED region and will throw an Exception and breaks
the scheduling of all the other regions. One example to show this problem case
can be found at
[PipelinedRegionSchedulingITCase#testRecoverFromPartitionException
|https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c].
To fix the problem, we can add a filter in
{{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only trigger
the scheduling of regions in CREATED state.
was:
Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING
consumers. This is because
{{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to
schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*,
even though the regions are not the exact consumers of the finished
*ExecutionVertex*. In this case, some of the regions can be in state other than
CREATED because they are not connected to and affected by the restarted tasks.
However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not
allow to schedule a non-CREATED region and will throw an Exception and breaks
the scheduling of all the other regions. One example to show this problem case
can be found at
[PipelinedRegionSchedulingITCase#testRecoverFromPartitionException
|https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c].
To fix the problem, we can add a filter in
{{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only trigger
the scheduling of regions in CREATED state.
> Job is possible to hang when restarting a FINISHED task with POINTWISE
> BLOCKING consumers
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-21707
> URL: https://issues.apache.org/jira/browse/FLINK-21707
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.11.3, 1.12.2, 1.13.0
> Reporter: Zhu Zhu
> Priority: Blocker
>
> Job is possible to hang when restarting a FINISHED task with POINTWISE
> BLOCKING consumers. This is because
> {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to
> schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*,
> even though the regions are not the exact consumers of the finished
> *ExecutionVertex*. In this case, some of the regions can be in non-CREATED
> state because they are not connected to nor affected by the restarted tasks.
> However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not
> allow to schedule a non-CREATED region and will throw an Exception and breaks
> the scheduling of all the other regions. One example to show this problem
> case can be found at
> [PipelinedRegionSchedulingITCase#testRecoverFromPartitionException
> |https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c].
> To fix the problem, we can add a filter in
> {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only
> trigger the scheduling of regions in CREATED state.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)