[jira] [Updated] (FLINK-32469) Improve checkpoint REST APIs for programmatic access

Hong Liang Teoh (Jira) Tue, 04 Jul 2023 05:41:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-32469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hong Liang Teoh updated FLINK-32469:
------------------------------------
    Description: 
*Why*

We want to enable programmatic use of the checkpoints REST API, independent of 
the Flink dashboard.

Currently, REST APIs that retrieve information relating to a given Flink job 
passes through the {{{}ExecutionGraphCache{}}}. This means that all these APIs 
will retrieve stale data depending on the {{{}web.refresh-interval{}}}, which 
defaults to 3s. For programmatic use of the REST API, we should be able to 
retrieve the latest / cached version depending on the client (Flink dashboard 
gets the cached version, other clients get the updated version).

For example, a user might want to use the REST API to retrieve the latest 
completed checkpoint for a given Flink job. This might be useful when trying to 
use existing checkpoints as state store when migrating a Flink job from one 
cluster to another. See Appendix for example.

*What*

This change is about separating out the cache used for the checkpoints REST 
APIs to a separate cache. This way, a user can set the timeout for the 
checkpoints cache to 0s (disable cache), without causing much effect on the 
user experience on the Flink dashboard.

In addition, the checkpoint handlers first retrieve the {{{}ExecutionGraph{}}}, 
then retrieve the {{CheckpointStatsSnapshot}} from the graph. This is not 
needed, since the checkpoint handlers only need the 
{{CheckpointStatsSnapshot.}} This change will mean these handlers retrieve the 
minimal required information ({{CheckpointStatsSnapshot)}} to construct a reply.

  was:
*Why*

We want to enable programmatic use of the checkpoints REST API, independent of 
the Flink dashboard.

Currently, REST APIs that retrieve information relating to a given Flink job 
passes through the {{{}ExecutionGraphCache{}}}. This means that all these APIs 
will retrieve stale data depending on the {{{}web.refresh-interval{}}}, which 
defaults to 3s.

For programmatic use of the REST API, ideally we should be able to retrieve the 
latest / cached version depending 

 

 

 

The current configuration of the `ExecutionGraph` cache is meant to facilitate 
a fluid user experience of the Flink dashboard. On the Job details page, the 
Flink dashboard makes a series of requests (e.g. /jobs/\{jobid}, 
/jobs/\{jobid}/vertices/\{vertexid}){color:#172b4d}. {color}

{color:#172b4d}To ensure that the requests return consistent results, we have 
the execution graph cache.{color}

 

*What*

The checkpoint handlers currently retrieve checkpoint information from the 
`ExecutionGraph`, which is cached in the `AbstractExecutionGraphHandler`. This 
means that this information is potentially stale (depending on the 
`web.refresh-interval`, which defaults to 3s).

We want to make the checkpoint handlers directly retrieve the latest 
`CheckpointStatsSnapshot` object instead of relying on the cached 
`ExecutionGraph`.

 
 
 


> Improve checkpoint REST APIs for programmatic access
> ----------------------------------------------------
>
>                 Key: FLINK-32469
>                 URL: https://issues.apache.org/jira/browse/FLINK-32469
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / REST
>    Affects Versions: 1.16.2, 1.17.1
>            Reporter: Hong Liang Teoh
>            Assignee: Hong Liang Teoh
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.18.0
>
>
> *Why*
> We want to enable programmatic use of the checkpoints REST API, independent 
> of the Flink dashboard.
> Currently, REST APIs that retrieve information relating to a given Flink job 
> passes through the {{{}ExecutionGraphCache{}}}. This means that all these 
> APIs will retrieve stale data depending on the {{{}web.refresh-interval{}}}, 
> which defaults to 3s. For programmatic use of the REST API, we should be able 
> to retrieve the latest / cached version depending on the client (Flink 
> dashboard gets the cached version, other clients get the updated version).
> For example, a user might want to use the REST API to retrieve the latest 
> completed checkpoint for a given Flink job. This might be useful when trying 
> to use existing checkpoints as state store when migrating a Flink job from 
> one cluster to another. See Appendix for example.
> *What*
> This change is about separating out the cache used for the checkpoints REST 
> APIs to a separate cache. This way, a user can set the timeout for the 
> checkpoints cache to 0s (disable cache), without causing much effect on the 
> user experience on the Flink dashboard.
> In addition, the checkpoint handlers first retrieve the 
> {{{}ExecutionGraph{}}}, then retrieve the {{CheckpointStatsSnapshot}} from 
> the graph. This is not needed, since the checkpoint handlers only need the 
> {{CheckpointStatsSnapshot.}} This change will mean these handlers retrieve 
> the minimal required information ({{CheckpointStatsSnapshot)}} to construct a 
> reply.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-32469) Improve checkpoint REST APIs for programmatic access

Reply via email to