[prometheus-users] Re: Best practice find slow query

Kaiming Wan Fri, 06 Mar 2020 20:30:07 -0800

Hello, how can i get the grfana dashboard like you to monitor prometheus's 
inner metric itself. Can the dashboard config be shared?


On Saturday, May 18, 2019 at 1:32:14 AM UTC+8, [email protected] wrote:
>
> Hello Prometheus experts,
>
> We are starting to use Prometheus since last October and it's awesome and 
> stable, thanks for creating this!
> Since our software released to production this year, it's working perfect 
> until recently.
> The query time and CPU/Memory consumption increased but we didn't 
> introduce new serious(other than some pod rotation), till now we can't find 
> the query cause the problem.
>
> The troubleshooting we have been done include the following:
> 1. We are using Prometheus-operator to deploy Prometheus, the latest 
> version is compatible with v2.7.2, so we upgrade from v2.5.0 to v2.7.2.
> The problem seems gone and the CPU/Memory back to normal for like a week 
> then the problem back again.
>
> 2. Checked some Prometheus internal metrics, attached some key metrics 
> below:
> It did show the Mem/CPU increased with the query time
> [image: Capture.PNG][image: Capture.PNG][image: Capture.PNG]
>
>
> [image: Capture.PNG]
>
>
>
>
>
> 3. Open debug logs, attached some logging sample below.
> Not identify key information from it, saw from a github issue that 
> Prometheus not expose slow query logs yet.
>
> level=debug ts=2019-05-17T16:54:42.168017996Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service 
> total 0 items received"
> level=debug ts=2019-05-17T16:54:42.170121992Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/monitoring/services?resourceVersion=24797369&timeout=8m40s&timeoutSeconds=520&watch=true
>  
> 200 OK in 1 milliseconds"
> level=debug ts=2019-05-17T16:54:52.269476837Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod 
> total 0 items received"
> level=debug ts=2019-05-17T16:54:52.32996453Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/monitoring/pods?resourceVersion=25142787&timeout=6m0s&timeoutSeconds=360&watch=true
>  
> 200 OK in 59 milliseconds"
> level=debug ts=2019-05-17T16:54:52.615343281Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service 
> total 0 items received"
> level=debug ts=2019-05-17T16:54:52.635172368Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/monitoring/services?resourceVersion=24797369&timeout=9m25s&timeoutSeconds=565&watch=true
>  
> 200 OK in 19 milliseconds"
> level=debug ts=2019-05-17T16:54:53.493438475Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod 
> total 0 items received"
> level=debug ts=2019-05-17T16:54:53.495189767Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/kube-system/pods?resourceVersion=25142787&timeout=6m50s&timeoutSeconds=410&watch=true
>  
> 200 OK in 1 milliseconds"
> level=debug ts=2019-05-17T16:54:58.626413142Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod 
> total 0 items received"
> level=debug ts=2019-05-17T16:54:58.628175101Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/default/pods?resourceVersion=25242284&timeout=7m17s&timeoutSeconds=437&watch=true
>  
> 200 OK in 1 milliseconds"
> level=error ts=2019-05-17T16:55:31.949151341Z caller=api.go:1099 
> component=web msg="error writing response" bytesWritten=0 err="write tcp 
> 10.0.77.181:9090->10.0.98.133:50764: write: broken pipe"
> level=error ts=2019-05-17T16:55:31.977217056Z caller=api.go:1099 
> component=web msg="error writing response" bytesWritten=0 err="write tcp 
> 10.0.77.181:9090->10.0.98.133:50758: write: broken pipe"
> level=error ts=2019-05-17T16:55:31.991923795Z caller=api.go:1099 
> component=web msg="error writing response" bytesWritten=0 err="write tcp 
> 10.0.77.181:9090->10.0.98.133:50786: write: broken pipe"
> level=debug ts=2019-05-17T16:55:54.827951493Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - 
> *v1.Endpoints total 0 items received"
> level=debug ts=2019-05-17T16:55:54.8373099Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/ingress-nginx/endpoints?resourceVersion=25278160&timeout=7m44s&timeoutSeconds=464&watch=true
>  
> 200 OK in 9 milliseconds"
> level=warn ts=2019-05-17T16:55:54.837493003Z caller=klog.go:86 
> component=k8s_client_runtime func=Warningf 
> msg="/app/discovery/kubernetes/kubernetes.go:300: watch of *v1.Endpoints 
> ended with: too old resource version: 25278160 (25278726)"
> level=debug ts=2019-05-17T16:55:55.837642377Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof msg="Listing and watching 
> *v1.Endpoints from /app/discovery/kubernetes/kubernetes.go:300"
> level=debug ts=2019-05-17T16:55:55.839818229Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/ingress-nginx/endpoints?limit=500&resourceVersion=0
>  
> 200 OK in 2 milliseconds"
> level=debug ts=2019-05-17T16:55:55.841480223Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/ingress-nginx/endpoints?resourceVersion=25280464&timeout=7m54s&timeoutSeconds=474&watch=true
>  
> 200 OK in 1 milliseconds"
> level=debug ts=2019-05-17T16:56:39.672732588Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:300: forcing resync"
> level=debug ts=2019-05-17T16:57:00.686226407Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - 
> *v1.Endpoints total 0 items received"
> level=debug ts=2019-05-17T16:57:00.687913369Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/monitoring/endpoints?resourceVersion=25278468&timeout=5m55s&timeoutSeconds=355&watch=true
>  
> 200 OK in 1 milliseconds"
> level=warn ts=2019-05-17T16:57:00.688149973Z caller=klog.go:86 
> component=k8s_client_runtime func=Warningf 
> msg="/app/discovery/kubernetes/kubernetes.go:300: watch of *v1.Endpoints 
> ended with: too old resource version: 25278468 (25278836)"
> level=debug ts=2019-05-17T16:57:01.688301933Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof msg="Listing and watching 
> *v1.Endpoints from /app/discovery/kubernetes/kubernetes.go:300"
> level=debug ts=2019-05-17T16:57:01.690457942Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/monitoring/endpoints?limit=500&resourceVersion=0
>  
> 200 OK in 2 milliseconds"
> level=debug ts=2019-05-17T16:57:01.692469494Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/monitoring/endpoints?resourceVersion=25280581&timeout=6m28s&timeoutSeconds=388&watch=true
>  
> 200 OK in 1 milliseconds"
> level=debug ts=2019-05-17T16:57:35.017591915Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - 
> *v1.Endpoints total 306 items received"
> level=debug ts=2019-05-17T16:57:35.027129763Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/kube-system/endpoints?resourceVersion=25280638&timeout=7m39s&timeoutSeconds=459&watch=true
>  
> 200 OK in 9 milliseconds"
> level=debug ts=2019-05-17T16:58:01.203323354Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:301: Watch close - *v1.Service 
> total 0 items received"
> level=debug ts=2019-05-17T16:58:01.213971777Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/ingress-nginx/services?resourceVersion=24797369&timeout=9m44s&timeoutSeconds=584&watch=true
>  
> 200 OK in 10 milliseconds"
> level=debug ts=2019-05-17T16:59:21.431595235Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:302: Watch close - *v1.Pod 
> total 0 items received"
> level=debug ts=2019-05-17T16:59:21.441100798Z caller=klog.go:70 
> component=k8s_client_runtime func=Infof msg="GET 
> https://172.20.0.1:443/api/v1/namespaces/monitoring/pods?resourceVersion=25142787&timeout=8m25s&timeoutSeconds=505&watch=true
>  
> 200 OK in 9 milliseconds"
> level=error ts=2019-05-17T16:59:49.77703169Z caller=api.go:1099 
> component=web msg="error writing response" bytesWritten=0 err="write tcp 
> 10.0.77.181:9090->10.0.98.133:51904: write: broken pipe"
> level=info ts=2019-05-17T17:00:03.594506257Z caller=compact.go:443 
> component=tsdb msg="write block" mint=1558101600000 maxt=1558108800000 
> ulid=01DB3BRWSDRAZE0CA29N1HXFTC
> level=info ts=2019-05-17T17:00:04.49353602Z caller=head.go:526 
> component=tsdb msg="head GC completed" duration=211.926469ms
> level=info ts=2019-05-17T17:00:08.512417508Z caller=head.go:573 
> component=tsdb msg="WAL checkpoint complete" first=5642 last=5650 
> duration=4.018819936s
> level=debug ts=2019-05-17T17:00:23.151252507Z caller=klog.go:53 
> component=k8s_client_runtime func=Verbose.Infof 
> msg="/app/discovery/kubernetes/kubernetes.go:300: Watch close - 
> *v1.Endpoints total 0 items received"
>
> 4. Disabled all alerts and make sure no one access the dashboard, the heap 
> size drops.(Dumb debugging way)
> [image: Capture.PNG]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> [image: Capture.PNG]
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Feel little lost about how to further troubleshooting this issue, a dumb 
> way is disable this chart one by one and see how it goes...Any advice or 
> guidence is appriciated, thanks in advance!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/de973db4-b725-4e04-9734-d3a8589e738d%40googlegroups.com.

[prometheus-users] Re: Best practice find slow query

Reply via email to