[jira] [Commented] (FLINK-37375) Checkpoint supports the Operator to customize asynchronous snapshot state

Zakelly Lan (Jira) Fri, 14 Mar 2025 00:42:45 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-37375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17935434#comment-17935434
 ]


Zakelly Lan commented on FLINK-37375:
-------------------------------------

[~hejufang001] I'd +1 for motivation and direction. Still reading the doc and I 
have some questions:

Will/Should the Flink take the result of the async procedure, which is required 
for recovery or not? Taking the state backend's checkpoint as an example, when 
the async phase finished, the handles are reported to JM and finalized. During 
recovery, these handles are sent back to state backend. In some cases the 
asynchronous procedure should come to a stop before Flink could get the meta 
(consuming offset, transactional id or something) and note it down, I'm 
wondering how could current design achieve this? Or Flink won't take any result 
from the asynchronous phase, as it is not required for recovery, that would be 
also fine. But we need to clarify the use case and the whole life cycle of 
checkpointing and recovery.

> Checkpoint supports the Operator to customize asynchronous snapshot state
> -------------------------------------------------------------------------
>
>                 Key: FLINK-37375
>                 URL: https://issues.apache.org/jira/browse/FLINK-37375
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.20.1
>            Reporter: Jufang He
>            Priority: Major
>
> In some Flink task operators, slow operations such as file uploads or data 
> flushing may be performed during the synchronous phase of Checkpoint. Due to 
> performance issues with external storage components, the synchronous phase 
> may take too long to execute, significantly impacting the job's throughput. 
> For example, during our internal use of Paimon, we observed that uploading 
> files to HDFS during the Checkpoint synchronous phase could encounter random 
> HDFS slow node issues, leading to a substantial negative impact on task 
> throughput.
> To address this issue, I propose supporting a generic operator custom 
> asynchronous snapshot feature, allowing users to move time-consuming logic to 
> the asynchronous phase of Checkpoint, thereby minimizing the blocking of the 
> main thread and improving task throughput. For instance, the Paimon writer 
> operator could write data locally during the Checkpoint synchronous phase and 
> upload files to remote storage during the asynchronous phase. Beyond the 
> Paimon data upload scenario, other operator logic may also experience slow 
> execution during the synchronous phase. This approach aims to uniformly 
> optimize such issues.
> I drafted a flip for this issue: 
> [https://docs.google.com/document/d/1lwxLEQjD6jVhZUBMRGhzQNWKSvdbPbYNQsV265gR4kw]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-37375) Checkpoint supports the Operator to customize asynchronous snapshot state

Reply via email to