[jira] [Commented] (FLINK-36753) Adaptive Scheduler actively triggers a Checkpoint

Rui Fan (Jira) Thu, 09 Jan 2025 19:39:54 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-36753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17911757#comment-17911757
 ]


Rui Fan commented on FLINK-36753:
---------------------------------

Thanks [~samrat007] for the summary! :)
{quote}1. Would extending active checkpoint triggering to downscaling also be 
appropriate?
{quote}
As you mentioned, triggering a checkpoint actively could release the resource 
earlier, so I prefer triggering a checkpoint actively for both downscaling and 
upscaling.
{quote}2. Should we respect the   execution.checkpointing.min-pause   
configuration when actively triggering a checkpoint for rescaling?
{quote}
I prefer to respect the execution.checkpointing.min-pause.

It's not enabled by default, and I found most of flink users or flink jobs 
don't enable it. So I think once it is enabled, the user may have some business 
logic intentions, and if we don't respect it, it may break the user's 
expectations.

There may be some reasons why users do not want to checkpoint frequently, for 
example:
 * Users want to spend more time processing data instead of making snapshots.
 * For flink jobs that write to hdfs, files will be flushed during checkpoint. 
Frequent checkpoints will generate a large number of small files.
 * For jobs with 2PC enabled(kafka producer transaction), frequent checkpoints 
will commit lots of transactions.

{quote}3. How to handle the case that there is a checkpoint already in process 
and  execution.checkpointing.max-concurrent-checkpoints  allows further 
concurrent checkpoints?
{quote}
The motivation for this proposal is that Rescale is executed after Checkpoint 
is completed, but it is possible that the next checkpoint will be triggered 
until a long time later.

So if one checkpoint is in process, regardless of whether the job enabled the 
max-concurrent-checkpoints,  Adaptive Scheduler doesn't need to trigger a new 
checkpoint.

These are just my thoughts, but I'm open to these questions.

> Adaptive Scheduler actively triggers a Checkpoint
> -------------------------------------------------
>
>                 Key: FLINK-36753
>                 URL: https://issues.apache.org/jira/browse/FLINK-36753
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 2.0-preview
>            Reporter: Rui Fan
>            Assignee: Samrat Deb
>            Priority: Major
>
> FLIP-461[1] and FLINK-35549[2] support that rescale could be executed after 
> the next completed checkpoint. It greatly reduces the amount of data replay 
> after rescale.
> In FLIP-461, Adaptive Scheduler waits for the next periodic checkpoint to be 
> triggered. In most scenarios, a more efficient solution might be Adaptive 
> Scheduler actively triggers a Checkpoint after all resources are 
> ready(Technically desire resources are ready).
> The idea comes from an offline discussion between [~mxm]  and [~fanrui].
> [1][https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler]
> [2] https://issues.apache.org/jira/browse/FLINK-35549



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36753) Adaptive Scheduler actively triggers a Checkpoint

Reply via email to