Matthias Pohl created FLINK-36512:
-------------------------------------
Summary: Make rescale trigger based on failed checkpoints depend
on the cause
Key: FLINK-36512
URL: https://issues.apache.org/jira/browse/FLINK-36512
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 2.0.0
Reporter: Matthias Pohl
[FLIP-461|https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
introduced rescale on checkpoints. The trigger logic is also initiated for
failed checkpoints (after a counter reached a configurable limit).
The issue here is that we might end up considering failed checkpoints which we
actually don't want to care about (e.g. checkpoint failures due to not all
tasks running, yet). Instead, we should start considering checkpoints only if
the job started running to avoid unnecessary (premature) rescale decisions.
We already have logic like that in place in the
[CheckpointCoordinator|https://github.com/apache/flink/blob/8be94e6663d8ac6e3d74bf4cd5f540cc96c8289e/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java#L217]
which we might want to use here as well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)