[
https://issues.apache.org/jira/browse/FLINK-29939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17630875#comment-17630875
]
Gyula Fora commented on FLINK-29939:
------------------------------------
Sounds good +1
> Add metrics for Kubernetes Client Response 5xx count and rate
> -------------------------------------------------------------
>
> Key: FLINK-29939
> URL: https://issues.apache.org/jira/browse/FLINK-29939
> Project: Flink
> Issue Type: Improvement
> Components: Kubernetes Operator
> Affects Versions: kubernetes-operator-1.3.0
> Reporter: Zhou Jiang
> Priority: Minor
>
> Operator now publishes k8s client response count by response code. In
> addition to the accumulative count, adding rate for k8s client error
> responses could help to setup alerts detect underlying cluster API server
> status proactively. This is for enhancement of metrics when Flink Operator is
> deployed to shared / multi-tenant k8s clusters.
>
> Why is rate needed for certain response codes?
> To detect issues proactively by setting up alerts in certain cases. It could
> not the total number but the rate indicates the start / end of unavailability
> issue.
>
> Why do some 4xx matter in prod?
> For example - noisy neighbor issue may happen at random time in shared
> clusters, and operator may start to see increased number of 429 if cluster
> does not have fairness in rate limiting. Another example is about churn: when
> the cluster has namespaces quota defined and namespace is under pod churn,
> there could be increasing number of 409. In these cases, metrics and alerting
> on count / rate of certain 4xx is critical to understand start / end of prod
> outage.
>
> Why is 5xx needed ?
> For faster identify infrastructure issue. With 5xx response count + rate,
> It's more straightforward than enumerating possible 5xx codes when setting up
> prod alerts.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)