[
https://issues.apache.org/jira/browse/FLINK-37232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Pohl reassigned FLINK-37232:
-------------------------------------
Assignee: Matthias Pohl
> FLIP-272 breaks some synchronization assumption on the AdaptiveScheduler's
> side
> -------------------------------------------------------------------------------
>
> Key: FLINK-37232
> URL: https://issues.apache.org/jira/browse/FLINK-37232
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 2.0.0, 2.0-preview
> Reporter: Matthias Pohl
> Assignee: Matthias Pohl
> Priority: Blocker
>
> We noticed some unexpected behavior with the AdaptiveScheduler causing a job
> to reach FAILED state due to {{NoResourceAvailableException}}. The cause was
> that some TaskManager shut down while the job was performing a rescaling
> operation.
> [~chesnay] did a bit of digging and identified an issue with the state
> transition short cut that was introduced in
> [FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
> (ignoring {{WaitingForResources}} when moving from {{Restarting}} to
> {{CreatingExecutionGraph}} as part of the rescale operation.
> The cause is that determining the parallelism for triggering the state
> transition from {{WaitingForResources}} into {{CreatingExecutionGraph}} is
> done in a single synchronous operation. No TM shutdown event can be processed
> in between. That leads to the {{determineParallelism}} call never failing.
> With the FLIP-472 approach, we call determineParallelism twice independently
> from each other:
> * When coming up with the rescale decision
> * When creating the ExecutionGraph after the job was cancelled.
> In between the two operations, anything can happen, i.e. also TM shutdown
> events can be processed. That could lead to the second
> {{determineParallelism}} call in the {{CreatingExecutionGraph}} state
> transition to fail (due to resources not being available) which is not
> properly handled in the
> {{CreatingExecutionGraph#handleExecutionGraphCreation}}.
> Right now, the expected behavior is that the {{determineParallelism}} call
> succeeds and the subsequent slot allocation might fail. If the slot
> allocation fails, transitioning back to {{WaitingForResources}} is performed.
> This behavior can be resolved in two ways:
> * Handle the {{NoResourceAvailableException}} in the
> {{CreatingExecutionGraph}} state
> * Pass the available VertexParallelism that lead to the rescale decision to
> the {{Restarting}} state and check when the job is cancelled whether that
> parallelism changed. If it didn't change, we could transition to the
> {{CreatingExecutionGraph}}. If it did change in the mean time, we should
> transition to {{WaitingForResources}} and try waiting for the resources in
> another round.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)