Damon Cortesi created SPARK-51613:
-------------------------------------
Summary: Improve Spark Operator metrics
Key: SPARK-51613
URL: https://issues.apache.org/jira/browse/SPARK-51613
Project: Spark
Issue Type: Improvement
Components: Kubernetes
Affects Versions: kubernetes-operator-0.1.0
Reporter: Damon Cortesi
Today the Spark Operator provides JVM, Kubernetes, and Java Operator SDK
metrics, but no metrics specific to the functionality and health of the Spark
App or Cluster resources managed by the operator. It would be nice to have
metrics like:
* Total counts of Apps or Clusters by state (Submitted, Failed, Successful,
etc)
* Gauges of Apps or Clusters by state (Submitted, Pending, Running, etc)
* Timers for Spark submit latency (Submission --> Running for example)
* Potentially depth of the reconciliation backlog and how many apps are
getting added per interval, although this may already be handled in the
operator SDK metrics via reconciliations_queue_size
In addition, it would be nice to have Prometheus metrics with labels, but it
doesn't look like Dropwizard supports that (nor likely to happen via
[https://github.com/dropwizard/metrics/issues/1272] ).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]