Matthias Pohl created FLINK-37232:
-------------------------------------
Summary: FLIP-272 breaks some synchronization assumption on the
AdaptiveScheduler's side
Key: FLINK-37232
URL: https://issues.apache.org/jira/browse/FLINK-37232
Project: Flink
Issue Type: Bug
Components: Runtime / Coordination
Affects Versions: 2.0-preview, 2.0.0
Reporter: Matthias Pohl
We noticed some unexpected behavior with the AdaptiveScheduler causing a job to
reach FAILED state due to {{NoResourceAvailableException}}. The cause was that
some TaskManager shut down while the job was performing a rescaling operation.
[~chesnay] did a bit of digging and identified an issue with the state
transition short cut that was introduced in
[FLIP-472|https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states]
(ignoring {{WaitingForResources}} when moving from {{Restarting}} to
{{CreatingExecutionGraph}} as part of the rescale operation.
The cause is that determining the parallelism for triggering the state
transition from {{WaitingForResources}} into {{CreatingExecutionGraph}} is done
in a single synchronous operation. No TM shutdown event can be processed in
between. That leads to the {{determineParallelism}} call never failing.
With the FLIP-472 approach, we call determineParallelism twice independently
from each other:
* When coming up with the rescale decision
* When creating the ExecutionGraph after the job was cancelled.
In between the two operations, anything can happen, i.e. also TM shutdown
events can be processed. That could lead to the second {{determineParallelism}}
call in the {{CreatingExecutionGraph}} state transition to fail (due to
resources not being available) which is not properly handled in the
{{CreatingExecutionGraph#handleExecutionGraphCreation}}.
Right now, the expected behavior is that the {{determineParallelism}} call
succeeds and the subsequent slot allocation might fail. If the slot allocation
fails, transitioning back to {{WaitingForResources}} is performed.
This behavior can be resolved in two ways:
* Handle the {{NoResourceAvailableException}} in the {{CreatingExecutionGraph}}
state
* Pass the available VertexParallelism that lead to the rescale decision to the
{{Restarting}} state and check when the job is cancelled whether that
parallelism changed. If it didn't change, we could transition to the
{{CreatingExecutionGraph}}. If it did change in the mean time, we should
transition to {{WaitingForResources}} and try waiting for the resources in
another round.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)