Turns out it was excessive disk read across all nodes. There was too many container start errors.
Thanks Derek for the tip and Solly for your time. I guess logs wont be necessary anymore. Regards, -- Mateus Caruccio / Master of Puppets GetupCloud.com We make the infrastructure invisible 2017-03-23 15:15 GMT+00:00 Solly Ross <[email protected]>: > Also, would it be possible to see the full set of Heapster logs? > Sometimes there can be useful signals in there as to what's going > on that aren't immediately obvious. > > Best Regards, > Solly Ross > > ----- Original Message ----- > > From: "Derek Carr" <[email protected]> > > To: "Mateus Caruccio" <[email protected]> > > Cc: "Solly Ross" <[email protected]>, [email protected] > > Sent: Wednesday, March 22, 2017 11:27:11 PM > > Subject: Re: Heapster failing for some pods > > > > Are you seeing high iops on impacted nodes? > > > > If so, it could be related to the following: > > https://github.com/openshift/origin/pull/12822 > > > > If so, you can try to remove thin_ls from your host so it will not be > used > > to do per container devicemapper usage stats in cAdvisor which has been > > shown to cause issues similar to this. > > > > Thanks, > > > > On Wed, Mar 22, 2017 at 9:42 PM Mateus Caruccio < > > [email protected]> wrote: > > > > > At > > > https://paste.fedoraproject.org/paste/FYFahXSMMQOVUWHkXcrer15M1UNdIG > YhyRLivL9gydE= > > > you can find a log grep from heapster with --sink=log set. > > > > > > Looking for pod "portal-107-rg2ia" one can see it's not being sinked > every > > > scraping period (only 3/9 during this snippet). > > > > > > > > > > > > -- > > > Mateus Caruccio / Master of Puppets > > > GetupCloud.com > > > We make the infrastructure invisible > > > > > > 2017-03-22 19:43 GMT-03:00 Derek Carr <[email protected]>: > > > > > > +Solly > > > > > > Anything you can assist with here? > > > > > > Thanks, > > > > > > On Wed, Mar 22, 2017 at 6:27 PM Mateus Caruccio < > > > [email protected]> wrote: > > > > > > Hi. > > > > > > Heapster is experiencing failures for some pods of the cluster, which > in > > > turn causes HPA to malfunction. > > > > > > From project events I can see: > > > > > > 2017-03-22T22:13:29Z 2017-03-22T21:32:59Z 32 portal > > > HorizontalPodAutoscaler Warning FailedGetMetrics > > > {horizontal-pod-autoscaler } failed to get CPU consumption and > request: > > > metrics obtained for 2/4 of pods > > > 2017-03-22T22:13:29Z 2017-03-22T21:32:59Z 32 portal > > > HorizontalPodAutoscaler Warning FailedComputeReplicas > > > {horizontal-pod-autoscaler } failed to get CPU utilization: failed > to get > > > CPU consumption and request: metrics obtained for 2/4 of pods > > > > > > > > > Heapster logs says some pods have no metrics, while other pods from the > > > same project does: > > > > > > I0322 22:10:29.104727 1 handlers.go:242] No metrics for container > > > wordpress in pod kondzilla/portal-107-rg2ia > > > I0322 22:10:29.104746 1 handlers.go:178] No metrics for pod > > > kondzilla/portal-107-rg2ia > > > ... > > > I0322 22:12:21.780763 1 pod_based_enricher.go:141] Container > > > namespace:kondzilla/pod:portal-107-rg2ia/container:wordpress not > found, > > > creating a stub > > > > > > > > > Hitting kubelete's /stats/container/ does returns valid stats, as > > > expected. > > > > > > > > > I'm running: > > > > > > openshift v1.3.1 > > > kubernetes v1.3.0+52492b4 > > > etcd 2.3.0+git > > > > > > openshift/origin-metrics-cassandra:v1.3.1 > > > openshift/origin-metrics-hawkular-metrics:v1.3.1 > > > openshift/origin-metrics-heapster:v1.3.2 (v1.3.1 has the same effect) > > > > > > > > > Thanks, > > > > > > -- > > > Mateus Caruccio / Master of Puppets > > > GetupCloud.com > > > We make the infrastructure invisible > > > _______________________________________________ > > > dev mailing list > > > [email protected] > > > http://lists.openshift.redhat.com/openshiftmm/listinfo/dev > > > > > > > > > > > >
_______________________________________________ dev mailing list [email protected] http://lists.openshift.redhat.com/openshiftmm/listinfo/dev
