[ 
https://issues.apache.org/jira/browse/FLINK-35624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17863717#comment-17863717
 ] 

Rui Fan commented on FLINK-35624:
---------------------------------

In brief, I found one unexpected behavior: all merged files under taskowned and 
shared cannot be cleaned when execution.checkpointing.file-merging.enabled is 
true and execution.checkpointing.externalized-checkpoint-retention is 
DELETE_ON_CANCELLATION.
h2. My test job:

[https://github.com/1996fanrui/fanrui-learning/blob/ac0e15e511fb88faf3dba9a0f1c50c37bec52d23/module-flink/src/main/java/com/dream/flink/uc/UnalignedCheckpointAndKeyedStateDemo.java]

The code includes all options. I set 
execution.checkpointing.file-merging.enabled= true, and I didn't set other 
options for file-merging.
h2. Flink version:

I build flink-1.20 from [https://github.com/apache/flink/pull/25031]
h2. Test progress:

Start flink standalone cluster via : 
[https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/overview/#starting-a-standalone-cluster-session-mode]
h3. Job1(Enable file merging, and set DELETE_ON_CANCELLATION):
 * 
execution.checkpointing.externalized-checkpoint-retention=DELETE_ON_CANCELLATION
 ** We hope all checkpoint files are cleaned after job is canceled.
 * execution.checkpointing.file-merging.enabled is true

After the job is canceled, all merged files under taskowned and shared cannot 
be cleaned. From the following picture, we can see the chk-x folder is cleaned, 
but all merged files are retained.

!image-2024-07-08-17-05-40-546.png!
h3. Job2(Enable file-merging, and observe the checkpoint):
 * 
execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION
 * execution.checkpointing.file-merging.enabled is true

The checkpoint works well.
h3. Job3(Enable file-merging, and restore from file-merging checkpoint):
 * 
execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION
 * execution.checkpointing.file-merging.enabled is true

Job3 restores from job2's checkpoint, it works well.

 
h3. Job4 (Disable file-merging, but restore from file-merging checkpoint):
 * 
execution.checkpointing.externalized-checkpoint-retention=RETAIN_ON_CANCELLATION
 * execution.checkpointing.file-merging.enabled is false

Job4 restores from job2's checkpoint, it works well.

 

 

 

 

> Release Testing: Verify FLIP-306 Unified File Merging Mechanism for 
> Checkpoints
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-35624
>                 URL: https://issues.apache.org/jira/browse/FLINK-35624
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Checkpointing
>            Reporter: Zakelly Lan
>            Assignee: Rui Fan
>            Priority: Blocker
>              Labels: release-testing
>             Fix For: 1.20.0
>
>         Attachments: image-2024-07-07-14-04-47-065.png, 
> image-2024-07-08-17-05-40-546.png
>
>
> Follow up the test for https://issues.apache.org/jira/browse/FLINK-32070
>  
> 1.20 is the MVP version for FLIP-306. It is a little bit complex and should 
> be tested carefully. The main idea of FLIP-306 is to merge checkpoint files 
> in TM side, and provide new {{{}StateHandle{}}}s to the JM. There will be a 
> TM-managed directory under the 'shared' checkpoint directory for each 
> subtask, and a TM-managed directory under the 'taskowned' checkpoint 
> directory for each Task Manager. Under those new introduced directories, the 
> checkpoint files will be merged into smaller file set. The following 
> scenarios need to be tested, including but not limited to:
>  # With the file merging enabled, periodic checkpoints perform properly, and 
> the failover, restore and rescale would also work well.
>  # Switch the file merging on and off across jobs, checkpoints and recovery 
> also work properly.
>  # There will be no left-over TM-managed directory, especially when there is 
> no cp complete before the job cancellation.
>  # File merging takes no effect in (native) savepoints.
> Besides the behaviors above, it is better to validate the function of space 
> amplification control and metrics. All the config options can be found under 
> 'execution.checkpointing.file-merging'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to