[
https://issues.apache.org/jira/browse/FLINK-36753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911326#comment-17911326
]
Samrat Deb commented on FLINK-36753:
------------------------------------
I have deep-dived into the requirements and feasibility of improvement and had
a one-on-one offline discussion with [~fanrui]. In brief, here are the main
open questions and things to consider when it comes to active triggering
checkpoints during rescaling:
Open Questions:
1. Would extending active checkpoint triggering to downscaling also be
appropriate? While downscaling would not require waiting for additional
resources, active checkpoint triggering will ensure faster release of resources.
2. Should we respect the `execution.checkpointing.min-pause` configuration when
actively triggering a checkpoint for rescaling?
My perspective:
`execution.checkpointing.min-pause` was introduced to serve the purpose that
Flink jobs actually run around being able to process data as opposed to being
so heavily involved in fault-tolerant related activities. The introduction of
active checkpoint triggering aligns with scenarios where resources are ready,
and triggering a checkpoint can lead to increased parallelism. With higher
parallelism, jobs will eventually process data more efficiently.
Active triggering during the downscale process will release resources much
earlier and lead to efficient resource usage by Flink. Ignoring
`execution.checkpointing.min-pause` brings evident performance benefits for
such cases.
But then, on the contrary, executing according to
`execution.checkpointing.min-pause` strictly adheres to user-defined
configurations that may cause delays in a situation where active triggering
would be beneficial.
Should performance boosts in the specific scenarios set foot over adherence to
user-specified checkpointing configurations?
3. In case, there is a checkpoint already in process and
`execution.checkpointing.max-concurrent-checkpoints` allows further concurrent
checkpoints, would you prefer to utilize the available space for actively
commencing a new checkpoint to improve the rescaling process? Otherwise, there
might be a chance that it would be assumed that the current checkpoint is
dealing with rescaling, and no further action will be taken.
[~fanrui] [~mxm] Thoughts?
> Adaptive Scheduler actively triggers a Checkpoint
> -------------------------------------------------
>
> Key: FLINK-36753
> URL: https://issues.apache.org/jira/browse/FLINK-36753
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 2.0-preview
> Reporter: Rui Fan
> Assignee: Samrat Deb
> Priority: Major
>
> FLIP-461[1] and FLINK-35549[2] support that rescale could be executed after
> the next completed checkpoint. It greatly reduces the amount of data replay
> after rescale.
> In FLIP-461, Adaptive Scheduler waits for the next periodic checkpoint to be
> triggered. In most scenarios, a more efficient solution might be Adaptive
> Scheduler actively triggers a Checkpoint after all resources are
> ready(Technically desire resources are ready).
> The idea comes from an offline discussion between [~mxm] and [~fanrui].
> [1][https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
> [2] https://issues.apache.org/jira/browse/FLINK-35549
--
This message was sent by Atlassian Jira
(v8.20.10#820010)