[ 
https://issues.apache.org/jira/browse/FLINK-29939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gyula Fora updated FLINK-29939:
-------------------------------
    Affects Version/s:     (was: kubernetes-operator-1.3.0)

> Add metrics for Kubernetes Client Response 5xx count and rate
> -------------------------------------------------------------
>
>                 Key: FLINK-29939
>                 URL: https://issues.apache.org/jira/browse/FLINK-29939
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Zhou Jiang
>            Assignee: Zhou Jiang
>            Priority: Minor
>              Labels: pull-request-available
>
> Operator now publishes k8s client response count by response code. In 
> addition to the accumulative count, adding rate for k8s client error 
> responses could help to setup alerts detect underlying cluster API server 
> status proactively. This is for enhancement of metrics when Flink Operator is 
> deployed to shared / multi-tenant k8s clusters. 
>  
> Why is rate needed for certain response codes?
> To detect issues proactively by setting up alerts in certain cases. It could 
> not the total number but the rate indicates the start / end of unavailability 
> issue.
>  
> Why do some 4xx matter in prod?
> For example - noisy neighbor issue may happen at random time in shared 
> clusters, and operator may start to see increased number of 429 if cluster 
> does not have fairness in rate limiting. Another example is about churn: when 
> the cluster has namespaces quota defined and namespace is under pod churn, 
> there could be increasing number of 409. In these cases, metrics and alerting 
> on count / rate of certain 4xx is critical to understand start / end of prod 
> outage.
>  
> Why is 5xx needed ?
> For faster identify infrastructure issue. With 5xx response count + rate, 
> It's more straightforward than enumerating possible 5xx codes when setting up 
> prod alerts.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to