[jira] [Commented] (FLINK-36932) Add resource-level metrics for different status/states to flink-kubernetes-operator

Philippe Gref-Viau (Jira) Mon, 13 Jan 2025 15:25:03 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-36932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912674#comment-17912674
 ]


Philippe Gref-Viau commented on FLINK-36932:
--------------------------------------------

Just bumping this up to see if I can get eyes on the issue and the proposed 
implementation.

> Add resource-level metrics for different status/states to 
> flink-kubernetes-operator
> -----------------------------------------------------------------------------------
>
>                 Key: FLINK-36932
>                 URL: https://issues.apache.org/jira/browse/FLINK-36932
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator, Runtime / Metrics
>            Reporter: Philippe Gref-Viau
>            Priority: Minor
>              Labels: flink-kubernetes-operator, metrics, 
> pull-request-available
>
> Operator-specific metrics were introduced as part of FLINK-26953. These 
> metrics are useful from a high-level reporting point of view (i.e. X many 
> FlinkDeployments are in state Y across the namespace), but they give no 
> insights as to the states/statuses of _individual_ (i.e. resource-level) 
> deployments. For example, there's currently no good signal to indicate if a 
> particular deployment is in a given lifecycle state.
> As part of our daily operational routine, we have found this lack of 
> resource-level metrics painful, since we cannot create graphs or alerts that 
> show the name of failing deployments. We can always turn to the metrics 
> emitted by Flink itself (ex: the {{<jobStatus>State}} Gauge metric available 
> on the JobManager) that are "faceted" by the job/deployment name, but in some 
> cases, a problem can occur before the jobs ever get to run and/or before 
> their metrics even get a chance to be emitted. There's also the fact that the 
> fact that not all status/states are covered by those metrics (i.e. lifecycle 
> states).
> Furthermore, the current set of metrics emitted for FlinkDeployments include 
> namespace-level counts for each Job Manager state, but it does not include 
> counts metrics for each Job status. Again, we can turn to metrics emitted 
> directly by Flink itself, but we run into the limitations I mentioned above.
> As such, we propose the following changes:
>  * Extending all of the existing "counter-based" metrics related to 
> status/state, so that each status/state also has a resource-level, 
> "gauge-based" metric that tracks whether each deployment (or the related 
> sub-resource, i.e. job/job manager) is in a given status/state
>  * Adding metrics to track the total count of Jobs in each status (by 
> namespace), and a gauge-based metric for each Job status (by deployment)
> Another way to present the suggested changes is to show what new items would 
> be added in the "Flink Resource Metrics" table shown on 
> [this|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#flink-resource-metrics]
>  page:
> ||Scope||Metrics||Description||Type||
> |Resource|FlinkDeployment.JmDeploymentStatus.<Status>.InStatus|For a given 
> Job Manager deployment status <Status>, return 1 if the Job Manager 
> associated with the FlinkDeployment is currently in that status, otherwise 
> return 0. <Status> can take values from: READY, DEPLOYED_NOT_READY, 
> DEPLOYING, MISSING, ERROR|Gauge|
> |Resource|FlinkDeployment.JobStatus.<Status>.InStatus|For a given job status 
> <Status>, return 1 if the job associated with the FlinkDeployment is 
> currently in that status, otherwise return 0. <Status> can take values from: 
> CANCELED, CANCELLING, CREATED, FAILED, FAILING, FINISHED, INITIALIZING, 
> RECONCILING, RESTARTING, RUNNING, SUSPENDED|Gauge|
> |Namespace|FlinkDeployment.JobStatus.<Status>.Count|Number of managed 
> FlinkDeployment resources per <Status> per namespace. <Status> can take 
> values from: CANCELED, CANCELLING, CREATED, FAILED, FAILING, FINISHED, 
> INITIALIZING, RECONCILING, RESTARTING, RUNNING, SUSPENDED|Gauge|
> |Resource|FlinkDeployment/FlinkSessionJob.Lifecycle.State.<State>.InState|For 
> a given lifecycle state <State>, return 1 if the managed resource is 
> currently in that state, otherwise return 0.  <State> can take values from: 
> CREATED, SUSPENDED, UPGRADING, DEPLOYED, STABLE, ROLLING_BACK, ROLLED_BACK, 
> FAILED|Gauge
>  
> |
>  
> We've actually already implemented these changes in our fork of the 
> flink-kubernetes-operator codebase, and it's been working pretty well. At his 
> point, we're interested in merging the changes back into the main branch to 
> avoid diverging from the releases share the improvement with the rest of the 
> community and get some feedback on our implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-36932) Add resource-level metrics for different status/states to flink-kubernetes-operator

Reply via email to