[jira] [Updated] (FLINK-21707) Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers

Zhu Zhu (Jira) Tue, 09 Mar 2021 23:58:15 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-21707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhu Zhu updated FLINK-21707:
----------------------------
    Description: 
Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING 
consumers. This is because 
{{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to 
schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, 
even though the regions are not the exact consumers of the finished 
*ExecutionVertex*. In this case, some of the regions can be in non-CREATED 
state because they are not connected to nor affected by the restarted tasks. 
However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not 
allow to schedule a non-CREATED region and will throw an Exception and breaks 
the scheduling of all the other regions. One example to show this problem case 
can be found at 
[PipelinedRegionSchedulingITCase#testRecoverFromPartitionException 
|https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c].

To fix the problem, we can add a filter in 
{{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only trigger 
the scheduling of regions in CREATED state.

  was:
Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING 
consumers. This is because 
{{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to 
schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, 
even though the regions are not the exact consumers of the finished 
*ExecutionVertex*. In this case, some of the regions can be in state other than 
CREATED because they are not connected to and affected by the restarted tasks. 
However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not 
allow to schedule a non-CREATED region and will throw an Exception and breaks 
the scheduling of all the other regions. One example to show this problem case 
can be found at 
[PipelinedRegionSchedulingITCase#testRecoverFromPartitionException 
|https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c].

To fix the problem, we can add a filter in 
{{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only trigger 
the scheduling of regions in CREATED state.


> Job is possible to hang when restarting a FINISHED task with POINTWISE 
> BLOCKING consumers
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-21707
>                 URL: https://issues.apache.org/jira/browse/FLINK-21707
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.3, 1.12.2, 1.13.0
>            Reporter: Zhu Zhu
>            Priority: Blocker
>
> Job is possible to hang when restarting a FINISHED task with POINTWISE 
> BLOCKING consumers. This is because 
> {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to 
> schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, 
> even though the regions are not the exact consumers of the finished 
> *ExecutionVertex*. In this case, some of the regions can be in non-CREATED 
> state because they are not connected to nor affected by the restarted tasks. 
> However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}} does not 
> allow to schedule a non-CREATED region and will throw an Exception and breaks 
> the scheduling of all the other regions. One example to show this problem 
> case can be found at 
> [PipelinedRegionSchedulingITCase#testRecoverFromPartitionException 
> |https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c].
> To fix the problem, we can add a filter in 
> {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} to only 
> trigger the scheduling of regions in CREATED state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-21707) Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers

Reply via email to