[
https://issues.apache.org/jira/browse/FLINK-36932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17912674#comment-17912674
]
Philippe Gref-Viau commented on FLINK-36932:
--------------------------------------------
Just bumping this up to see if I can get eyes on the issue and the proposed
implementation.
> Add resource-level metrics for different status/states to
> flink-kubernetes-operator
> -----------------------------------------------------------------------------------
>
> Key: FLINK-36932
> URL: https://issues.apache.org/jira/browse/FLINK-36932
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator, Runtime / Metrics
> Reporter: Philippe Gref-Viau
> Priority: Minor
> Labels: flink-kubernetes-operator, metrics,
> pull-request-available
>
> Operator-specific metrics were introduced as part of FLINK-26953. These
> metrics are useful from a high-level reporting point of view (i.e. X many
> FlinkDeployments are in state Y across the namespace), but they give no
> insights as to the states/statuses of _individual_ (i.e. resource-level)
> deployments. For example, there's currently no good signal to indicate if a
> particular deployment is in a given lifecycle state.
> As part of our daily operational routine, we have found this lack of
> resource-level metrics painful, since we cannot create graphs or alerts that
> show the name of failing deployments. We can always turn to the metrics
> emitted by Flink itself (ex: the {{<jobStatus>State}} Gauge metric available
> on the JobManager) that are "faceted" by the job/deployment name, but in some
> cases, a problem can occur before the jobs ever get to run and/or before
> their metrics even get a chance to be emitted. There's also the fact that the
> fact that not all status/states are covered by those metrics (i.e. lifecycle
> states).
> Furthermore, the current set of metrics emitted for FlinkDeployments include
> namespace-level counts for each Job Manager state, but it does not include
> counts metrics for each Job status. Again, we can turn to metrics emitted
> directly by Flink itself, but we run into the limitations I mentioned above.
> As such, we propose the following changes:
> * Extending all of the existing "counter-based" metrics related to
> status/state, so that each status/state also has a resource-level,
> "gauge-based" metric that tracks whether each deployment (or the related
> sub-resource, i.e. job/job manager) is in a given status/state
> * Adding metrics to track the total count of Jobs in each status (by
> namespace), and a gauge-based metric for each Job status (by deployment)
> Another way to present the suggested changes is to show what new items would
> be added in the "Flink Resource Metrics" table shown on
> [this|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#flink-resource-metrics]
> page:
> ||Scope||Metrics||Description||Type||
> |Resource|FlinkDeployment.JmDeploymentStatus.<Status>.InStatus|For a given
> Job Manager deployment status <Status>, return 1 if the Job Manager
> associated with the FlinkDeployment is currently in that status, otherwise
> return 0. <Status> can take values from: READY, DEPLOYED_NOT_READY,
> DEPLOYING, MISSING, ERROR|Gauge|
> |Resource|FlinkDeployment.JobStatus.<Status>.InStatus|For a given job status
> <Status>, return 1 if the job associated with the FlinkDeployment is
> currently in that status, otherwise return 0. <Status> can take values from:
> CANCELED, CANCELLING, CREATED, FAILED, FAILING, FINISHED, INITIALIZING,
> RECONCILING, RESTARTING, RUNNING, SUSPENDED|Gauge|
> |Namespace|FlinkDeployment.JobStatus.<Status>.Count|Number of managed
> FlinkDeployment resources per <Status> per namespace. <Status> can take
> values from: CANCELED, CANCELLING, CREATED, FAILED, FAILING, FINISHED,
> INITIALIZING, RECONCILING, RESTARTING, RUNNING, SUSPENDED|Gauge|
> |Resource|FlinkDeployment/FlinkSessionJob.Lifecycle.State.<State>.InState|For
> a given lifecycle state <State>, return 1 if the managed resource is
> currently in that state, otherwise return 0. <State> can take values from:
> CREATED, SUSPENDED, UPGRADING, DEPLOYED, STABLE, ROLLING_BACK, ROLLED_BACK,
> FAILED|Gauge
>
> |
>
> We've actually already implemented these changes in our fork of the
> flink-kubernetes-operator codebase, and it's been working pretty well. At his
> point, we're interested in merging the changes back into the main branch to
> avoid diverging from the releases share the improvement with the rest of the
> community and get some feedback on our implementation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)