[ 
https://issues.apache.org/jira/browse/FLINK-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhu Zhu updated FLINK-14375:
----------------------------
    Description: 
The DefaultScheduler triggers failover if a task is notified to be FAILED. 
However, in the case the multiple tasks in the same region fail together, it 
will trigger multiple failovers. The later triggered failovers are useless, 
lead to concurrent failovers and will increase the restart attempts count.

I think the deep reason for this issue is that some fake state changes are 
notified to the DefaultScheduler.
The case above is a FAILED state change from TM will turn a CANCELING vertex to 
CANCELED, and the actual state transition is to CANCELED. But a FAILED state is 
notified to DefaultScheduler.

And there can be another possible issue caused by it, that a FINISHED state 
change is notified from TM when a vertex is CANCELING. The vertex will become 
CANCELED, while its FINISHED state change will be notified to DefaultScheduler 
which may trigger downstream task scheduling.

I'd propose to fix it like this. 
 - The DefaultScheduler does not handle the state update in 
SchedulerBase#updateTaskExecutionState.
 - It only handles state transitions that really happened in ExecutionGraph (on 
Execution#transitionState or vertex reset).

Besides avoiding fake state update, it also has a few other benefits:
1. we can get rid of the hack code in Execution#processFail
2. all state changes will be notified to SchedulingStrategy, i.e. it solves 
FLINK-14233

Here's the POC of this proposal 
https://github.com/zhuzhurk/flink/commits/refactor_state_update.




  was:
The DefaultScheduler triggers failover if a task is notified to be FAILED. 
However, in the case the multiple tasks in the same region fail together, it 
will trigger multiple failovers. The later triggered failovers are useless, 
lead to concurrent failovers and will increase the restart attempts count.

I think the deep reason for this issue is that some fake state changes are 
notified to the DefaultScheduler.
The case above is a FAILED state change from TM will turn a CANCELING vertex to 
CANCELED, and the actual state transition is to CANCELED. But a FAILED state is 
notified to DefaultScheduler.

And there can be another possible issue caused by it, that a FINISHED state 
change is notified from TM when a vertex is CANCELING. The vertex will become 
CANCELED, while its FINISHED state change will be notified to DefaultScheduler 
which may trigger downstream task scheduling.

I'd propose to fix it like this. 
 - The DefaultScheduler does not handle the state update in 
SchedulerBase#updateTaskExecutionState.
 - It only handles state transitions that really happened in ExecutionGraph (on 
Execution#transitionState or vertex reset).

Besides avoiding fake state update, it also has a few other benefits:
1. we can get rid of the hack code in Execution#processFail
2. all state changes will be notified to SchedulingStrategy, i.e. it solves 
FLINK-14233

Here's the POC of this proposal 
https://github.com/zhuzhurk/flink/commits/refactor_state_update.


> Refactor task state updating to only notify scheduler about state changes 
> that really happened
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-14375
>                 URL: https://issues.apache.org/jira/browse/FLINK-14375
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: Zhu Zhu
>            Assignee: Zhu Zhu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.10.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> The DefaultScheduler triggers failover if a task is notified to be FAILED. 
> However, in the case the multiple tasks in the same region fail together, it 
> will trigger multiple failovers. The later triggered failovers are useless, 
> lead to concurrent failovers and will increase the restart attempts count.
> I think the deep reason for this issue is that some fake state changes are 
> notified to the DefaultScheduler.
> The case above is a FAILED state change from TM will turn a CANCELING vertex 
> to CANCELED, and the actual state transition is to CANCELED. But a FAILED 
> state is notified to DefaultScheduler.
> And there can be another possible issue caused by it, that a FINISHED state 
> change is notified from TM when a vertex is CANCELING. The vertex will become 
> CANCELED, while its FINISHED state change will be notified to 
> DefaultScheduler which may trigger downstream task scheduling.
> I'd propose to fix it like this. 
>  - The DefaultScheduler does not handle the state update in 
> SchedulerBase#updateTaskExecutionState.
>  - It only handles state transitions that really happened in ExecutionGraph 
> (on Execution#transitionState or vertex reset).
> Besides avoiding fake state update, it also has a few other benefits:
> 1. we can get rid of the hack code in Execution#processFail
> 2. all state changes will be notified to SchedulingStrategy, i.e. it solves 
> FLINK-14233
> Here's the POC of this proposal 
> https://github.com/zhuzhurk/flink/commits/refactor_state_update.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to