[ 
https://issues.apache.org/jira/browse/FLINK-32881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renxiang Zhou updated FLINK-32881:
----------------------------------
    Description: 
When triggering a savepoint using the command-line tool, the client needs to 
wait for the job to finish creating the savepoint before it can exit. For jobs 
with large state, the savepoint creation process can be time-consuming, leading 
to the following problems:
 # Platform users may need to manage thousands of Flink tasks on a single 
client machine. With the current savepoint triggering mode, all savepoint 
creation threads on that machine have to wait for the job to finish the 
snapshot, resulting in significant resource waste;
 # If the savepoint producing time exceeds the client's timeout duration, the 
client will throw a timeout exception and report that the triggering savepoint 
process fails. Since different jobs have varying savepoint durations, it is 
difficult to adjust the timeout parameter on the client side.

Therefore, we propose adding a detach mode to trigger savepoints on the client 
side, just similar to the detach mode behavior when submitting jobs. Here are 
some specific details:
 # The savepoint UUID will be generated on the client side. After successfully 
triggering the savepoint, the client immediately returns the UUID information 
and exits.
 # Add a "dump-pending-savepoints" API that allows the client to check whether 
the triggered savepoint has been successfully created.

By implementing these changes, the client can detach from the savepoint 
creation process, reducing resource waste, and providing a way to check the 
status of savepoint creation.

!image-2023-08-16-17-14-34-740.png|width=2129,height=625!!image-2023-08-16-17-14-44-212.png|width=1530,height=445!

  was:
When triggering a savepoint using the command-line tool, the client needs to 
wait for the job to finish creating the savepoint before it can exit. For jobs 
with large state, the savepoint creation process can be time-consuming, leading 
to the following problems:
 # Platform users may need to manage thousands of Flink tasks on a single 
client machine. With the current savepoint triggering mode, all savepoint 
creation threads on that machine have to wait for the job to finish the 
snapshot, resulting in significant resource waste;
 # If the savepoint producing time exceeds the client's timeout duration, the 
client will throw a timeout exception and report that the trggering savepoint 
process fails. Since different jobs have varying savepoint durations, it is 
difficult to adjust the client's timeout parameter.

Therefore, we propose adding a detach mode to trigger savepoints on the client 
side, just similar to the detach mode behavior when submitting jobs. Here are 
some specific details:
 # The savepoint UUID will be generated on the client side. After successfully 
triggering the savepoint, the client immediately returns the UUID information.
 # Add a "dump-pending-savepoints" API that allows the client to check whether 
the triggered savepoint has been successfully created.

By implementing these changes, the client can detach from the savepoint 
creation process, reducing resource waste, and providing a way to check the 
status of savepoint creation.

!image-2023-08-16-17-14-34-740.png|width=2129,height=625!!image-2023-08-16-17-14-44-212.png|width=1530,height=445!


> Client supports making savepoints in detach mode
> ------------------------------------------------
>
>                 Key: FLINK-32881
>                 URL: https://issues.apache.org/jira/browse/FLINK-32881
>             Project: Flink
>          Issue Type: Improvement
>          Components: API / State Processor, Client / Job Submission
>    Affects Versions: 1.19.0
>            Reporter: Renxiang Zhou
>            Priority: Major
>              Labels: detach-savepoint
>             Fix For: 1.19.0
>
>         Attachments: image-2023-08-16-17-14-34-740.png, 
> image-2023-08-16-17-14-44-212.png
>
>
> When triggering a savepoint using the command-line tool, the client needs to 
> wait for the job to finish creating the savepoint before it can exit. For 
> jobs with large state, the savepoint creation process can be time-consuming, 
> leading to the following problems:
>  # Platform users may need to manage thousands of Flink tasks on a single 
> client machine. With the current savepoint triggering mode, all savepoint 
> creation threads on that machine have to wait for the job to finish the 
> snapshot, resulting in significant resource waste;
>  # If the savepoint producing time exceeds the client's timeout duration, the 
> client will throw a timeout exception and report that the triggering 
> savepoint process fails. Since different jobs have varying savepoint 
> durations, it is difficult to adjust the timeout parameter on the client side.
> Therefore, we propose adding a detach mode to trigger savepoints on the 
> client side, just similar to the detach mode behavior when submitting jobs. 
> Here are some specific details:
>  # The savepoint UUID will be generated on the client side. After 
> successfully triggering the savepoint, the client immediately returns the 
> UUID information and exits.
>  # Add a "dump-pending-savepoints" API that allows the client to check 
> whether the triggered savepoint has been successfully created.
> By implementing these changes, the client can detach from the savepoint 
> creation process, reducing resource waste, and providing a way to check the 
> status of savepoint creation.
> !image-2023-08-16-17-14-34-740.png|width=2129,height=625!!image-2023-08-16-17-14-44-212.png|width=1530,height=445!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to