[
https://issues.apache.org/jira/browse/FLINK-36753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911757#comment-17911757
]
Rui Fan commented on FLINK-36753:
---------------------------------
Thanks [~samrat007] for the summary! :)
{quote}1. Would extending active checkpoint triggering to downscaling also be
appropriate?
{quote}
As you mentioned, triggering a checkpoint actively could release the resource
earlier, so I prefer triggering a checkpoint actively for both downscaling and
upscaling.
{quote}2. Should we respect the execution.checkpointing.min-pause
configuration when actively triggering a checkpoint for rescaling?
{quote}
I prefer to respect the execution.checkpointing.min-pause.
It's not enabled by default, and I found most of flink users or flink jobs
don't enable it. So I think once it is enabled, the user may have some business
logic intentions, and if we don't respect it, it may break the user's
expectations.
There may be some reasons why users do not want to checkpoint frequently, for
example:
* Users want to spend more time processing data instead of making snapshots.
* For flink jobs that write to hdfs, files will be flushed during checkpoint.
Frequent checkpoints will generate a large number of small files.
* For jobs with 2PC enabled(kafka producer transaction), frequent checkpoints
will commit lots of transactions.
{quote}3. How to handle the case that there is a checkpoint already in process
and execution.checkpointing.max-concurrent-checkpoints allows further
concurrent checkpoints?
{quote}
The motivation for this proposal is that Rescale is executed after Checkpoint
is completed, but it is possible that the next checkpoint will be triggered
until a long time later.
So if one checkpoint is in process, regardless of whether the job enabled the
max-concurrent-checkpoints, Adaptive Scheduler doesn't need to trigger a new
checkpoint.
These are just my thoughts, but I'm open to these questions.
> Adaptive Scheduler actively triggers a Checkpoint
> -------------------------------------------------
>
> Key: FLINK-36753
> URL: https://issues.apache.org/jira/browse/FLINK-36753
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 2.0-preview
> Reporter: Rui Fan
> Assignee: Samrat Deb
> Priority: Major
>
> FLIP-461[1] and FLINK-35549[2] support that rescale could be executed after
> the next completed checkpoint. It greatly reduces the amount of data replay
> after rescale.
> In FLIP-461, Adaptive Scheduler waits for the next periodic checkpoint to be
> triggered. In most scenarios, a more efficient solution might be Adaptive
> Scheduler actively triggers a Checkpoint after all resources are
> ready(Technically desire resources are ready).
> The idea comes from an offline discussion between [~mxm] and [~fanrui].
> [1][https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
> [2] https://issues.apache.org/jira/browse/FLINK-35549
--
This message was sent by Atlassian Jira
(v8.20.10#820010)