[
https://issues.apache.org/jira/browse/FLINK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739912#comment-17739912
]
Hong Liang Teoh commented on FLINK-32469:
-----------------------------------------
Thanks for the review [~dmvk] , Rephrased the wording! Let me know if you have
other concerns
> Improve checkpoint REST APIs for programmatic access
> ----------------------------------------------------
>
> Key: FLINK-32469
> URL: https://issues.apache.org/jira/browse/FLINK-32469
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / REST
> Affects Versions: 1.16.2, 1.17.1
> Reporter: Hong Liang Teoh
> Assignee: Hong Liang Teoh
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.18.0
>
>
> *Why*
> We want to enable programmatic use of the checkpoints REST API, independent
> of the Flink dashboard.
> Currently, REST APIs that retrieve information relating to a given Flink job
> passes through the {{{}ExecutionGraphCache{}}}. This means that all these
> APIs will retrieve stale data depending on the {{{}web.refresh-interval{}}},
> which defaults to 3s. For programmatic use of the REST API, we should be able
> to retrieve the latest / cached version depending on the client (Flink
> dashboard gets the cached version, other clients get the updated version).
> For example, a user might want to use the REST API to retrieve the latest
> completed checkpoint for a given Flink job. This might be useful when trying
> to use existing checkpoints as state store when migrating a Flink job from
> one cluster to another. See Appendix for example.
> *What*
> This change is about separating out the cache used for the checkpoints REST
> APIs to a separate cache. This way, a user can set the timeout for the
> checkpoints cache to 0s (disable cache), without causing much effect on the
> user experience on the Flink dashboard.
> In addition, the checkpoint handlers first retrieve the
> {{{}ExecutionGraph{}}}, then retrieve the {{CheckpointStatsSnapshot}} from
> the graph. This is not needed, since the checkpoint handlers only need the
> {{CheckpointStatsSnapshot.}} This change will mean these handlers retrieve
> the minimal required information ({{{}CheckpointStatsSnapshot){}}} to
> construct a reply.
>
> *Example use case*
> When performing security patching / maintenance of the infrastructure
> supporting the Flink cluster, we might want to transfer a given Flink job to
> another cluster, whilst maintaining state. We can do this via the below steps:
> # Old cluster - Select completed checkpoint on existing Flink job
> # Old cluster - Stop the existing Flink job
> # New cluster - Start a new Flink job with selected checkpoint
> Step 1 requires us to query the checkpoints REST API for the latest completed
> checkpoint. With the status quo, we need to wait 3s (or whatever the
> ExecutionGraphCache expiry may be). This is undesirable because this means
> the Flink job will have to reprocess data equivalent to 3s / whatever the
> execution graph cache timeout is.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)