[jira] [Assigned] (FLINK-37232) FLIP-272 breaks some synchronization assumption on the AdaptiveScheduler's side

Matthias Pohl (Jira) Tue, 28 Jan 2025 22:43:08 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-37232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Matthias Pohl reassigned FLINK-37232:
-------------------------------------

    Assignee: Matthias Pohl

> FLIP-272 breaks some synchronization assumption on the AdaptiveScheduler's 
> side
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-37232
>                 URL: https://issues.apache.org/jira/browse/FLINK-37232
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 2.0.0, 2.0-preview
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Blocker
>
> We noticed some unexpected behavior with the AdaptiveScheduler causing a job 
> to reach FAILED state due to {{NoResourceAvailableException}}. The cause was 
> that some TaskManager shut down while the job was performing a rescaling 
> operation.
> [~chesnay] did a bit of digging and identified an issue with the state 
> transition short cut that was introduced in 
> [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
>  (ignoring {{WaitingForResources}} when moving from {{Restarting}} to 
> {{CreatingExecutionGraph}} as part of the rescale operation.
> The cause is that determining the parallelism for triggering the state 
> transition from {{WaitingForResources}} into {{CreatingExecutionGraph}} is 
> done in a single synchronous operation. No TM shutdown event can be processed 
> in between. That leads to the {{determineParallelism}} call never failing.
> With the FLIP-472 approach, we call determineParallelism twice independently 
> from each other:
> * When coming up with the rescale decision
> * When creating the ExecutionGraph after the job was cancelled.
> In between the two operations, anything can happen, i.e. also TM shutdown 
> events can be processed. That could lead to the second 
> {{determineParallelism}} call in the {{CreatingExecutionGraph}} state 
> transition to fail (due to resources not being available) which is not 
> properly handled in the 
> {{CreatingExecutionGraph#handleExecutionGraphCreation}}. 
> Right now, the expected behavior is that the {{determineParallelism}} call 
> succeeds and the subsequent slot allocation might fail. If the slot 
> allocation fails, transitioning back to {{WaitingForResources}} is performed.
> This behavior can be resolved in two ways:
> * Handle the {{NoResourceAvailableException}} in the 
> {{CreatingExecutionGraph}} state
> * Pass the available VertexParallelism that lead to the rescale decision to 
> the {{Restarting}} state and check when the job is cancelled whether that 
> parallelism changed. If it didn't change, we could transition to the 
> {{CreatingExecutionGraph}}. If it did change in the mean time, we should 
> transition to {{WaitingForResources}} and try waiting for the resources in 
> another round.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (FLINK-37232) FLIP-272 breaks some synchronization assumption on the AdaptiveScheduler's side

Reply via email to