[ https://issues.apache.org/jira/browse/FLINK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hong Liang Teoh updated FLINK-32469: ------------------------------------ Description: *Why* We want to enable programmatic use of the checkpoints REST API, independent of the Flink dashboard. Currently, REST APIs that retrieve information relating to a given Flink job passes through the {{{}ExecutionGraphCache{}}}. This means that all these APIs will retrieve stale data depending on the {{{}web.refresh-interval{}}}, which defaults to 3s. For programmatic use of the REST API, we should be able to retrieve the latest / cached version depending on the client (Flink dashboard gets the cached version, other clients get the updated version). For example, a user might want to use the REST API to retrieve the latest completed checkpoint for a given Flink job. This might be useful when trying to use existing checkpoints as state store when migrating a Flink job from one cluster to another. See Appendix for example. *What* This change is about separating out the cache used for the checkpoints REST APIs to a separate cache. This way, a user can set the timeout for the checkpoints cache to 0s (disable cache), without causing much effect on the user experience on the Flink dashboard. In addition, the checkpoint handlers first retrieve the {{{}ExecutionGraph{}}}, then retrieve the {{CheckpointStatsSnapshot}} from the graph. This is not needed, since the checkpoint handlers only need the {{CheckpointStatsSnapshot.}} This change will mean these handlers retrieve the minimal required information ({{{}CheckpointStatsSnapshot){}}} to construct a reply. *Example use case* When performing security patching / maintenance of the infrastructure supporting the Flink cluster, we might want to transfer a given Flink job to another cluster, whilst maintaining state. We can do this via the below steps: # Old cluster - Select completed checkpoint on existing Flink job # Old cluster - Stop the existing Flink job # New cluster - Start a new Flink job with selected checkpoint Step 1 requires us to query the checkpoints REST API for the latest completed checkpoint. With the status quo, we need to wait 3s (or whatever the ExecutionGraphCache expiry may be). This is undesirable because this means the Flink job will have to reprocess data equivalent to 3s / whatever the execution graph cache timeout is. was: *Why* We want to enable programmatic use of the checkpoints REST API, independent of the Flink dashboard. Currently, REST APIs that retrieve information relating to a given Flink job passes through the {{{}ExecutionGraphCache{}}}. This means that all these APIs will retrieve stale data depending on the {{{}web.refresh-interval{}}}, which defaults to 3s. For programmatic use of the REST API, we should be able to retrieve the latest / cached version depending on the client (Flink dashboard gets the cached version, other clients get the updated version). For example, a user might want to use the REST API to retrieve the latest completed checkpoint for a given Flink job. This might be useful when trying to use existing checkpoints as state store when migrating a Flink job from one cluster to another. See Appendix for example. *What* This change is about separating out the cache used for the checkpoints REST APIs to a separate cache. This way, a user can set the timeout for the checkpoints cache to 0s (disable cache), without causing much effect on the user experience on the Flink dashboard. In addition, the checkpoint handlers first retrieve the {{{}ExecutionGraph{}}}, then retrieve the {{CheckpointStatsSnapshot}} from the graph. This is not needed, since the checkpoint handlers only need the {{CheckpointStatsSnapshot.}} This change will mean these handlers retrieve the minimal required information ({{CheckpointStatsSnapshot)}} to construct a reply. > Improve checkpoint REST APIs for programmatic access > ---------------------------------------------------- > > Key: FLINK-32469 > URL: https://issues.apache.org/jira/browse/FLINK-32469 > Project: Flink > Issue Type: Improvement > Components: Runtime / REST > Affects Versions: 1.16.2, 1.17.1 > Reporter: Hong Liang Teoh > Assignee: Hong Liang Teoh > Priority: Major > Labels: pull-request-available > Fix For: 1.18.0 > > > *Why* > We want to enable programmatic use of the checkpoints REST API, independent > of the Flink dashboard. > Currently, REST APIs that retrieve information relating to a given Flink job > passes through the {{{}ExecutionGraphCache{}}}. This means that all these > APIs will retrieve stale data depending on the {{{}web.refresh-interval{}}}, > which defaults to 3s. For programmatic use of the REST API, we should be able > to retrieve the latest / cached version depending on the client (Flink > dashboard gets the cached version, other clients get the updated version). > For example, a user might want to use the REST API to retrieve the latest > completed checkpoint for a given Flink job. This might be useful when trying > to use existing checkpoints as state store when migrating a Flink job from > one cluster to another. See Appendix for example. > *What* > This change is about separating out the cache used for the checkpoints REST > APIs to a separate cache. This way, a user can set the timeout for the > checkpoints cache to 0s (disable cache), without causing much effect on the > user experience on the Flink dashboard. > In addition, the checkpoint handlers first retrieve the > {{{}ExecutionGraph{}}}, then retrieve the {{CheckpointStatsSnapshot}} from > the graph. This is not needed, since the checkpoint handlers only need the > {{CheckpointStatsSnapshot.}} This change will mean these handlers retrieve > the minimal required information ({{{}CheckpointStatsSnapshot){}}} to > construct a reply. > > *Example use case* > When performing security patching / maintenance of the infrastructure > supporting the Flink cluster, we might want to transfer a given Flink job to > another cluster, whilst maintaining state. We can do this via the below steps: > # Old cluster - Select completed checkpoint on existing Flink job > # Old cluster - Stop the existing Flink job > # New cluster - Start a new Flink job with selected checkpoint > Step 1 requires us to query the checkpoints REST API for the latest completed > checkpoint. With the status quo, we need to wait 3s (or whatever the > ExecutionGraphCache expiry may be). This is undesirable because this means > the Flink job will have to reprocess data equivalent to 3s / whatever the > execution graph cache timeout is. -- This message was sent by Atlassian Jira (v8.20.10#820010)