devmadhuu opened a new pull request, #7092:
URL: https://github.com/apache/ozone/pull/7092
## What changes were proposed in this pull request?
HDDS-11339. Publishing hadoop metrics immediately in Prometheus sink fills
up SinkQueue quickly.
Two issues:
`PrometheusServlet` is being registered with `BaseHttpServer` when
prometheus support is enabled and `PrometheusServlet` is being called every 15
secs by default as scraping interval and it publishes the hadoop metrics
immediately. So if there are large number of metrics needs to be published in a
very busy cluster, this makes SinkQueue gets filled up quickly and then sink
cannot consume the given metrics and just dropped them outright.
A part from dropping , another issue is taking the object lock of
`MetricsSystemImpl` class and before metrics actually being published, other
threads keeps waiting to take the object lock. There was a recent issue came to
highlight where in a busy cluster, there were ~ 190 threads BLOCKED just to
acquire the lock of the `MetricsSystemImpl` class. This makes Recon role
unresponsive and after sometime JVM couldn't allocate sufficient memory and
crashes with OOM. This prometheus and OOM issue is not related to Recon
directly as this can happen with any role which is going to use Prometheus
service in a busy cluster.
`Solution:` We need not to publish the metrics immediately by calling
`DefaultMetricsSystem.instance().publishMetricsNow();`
because a prometheus sink already have a mechanism to publish metrics every
10 secs by default using call back with timer event. So we need to remove the
above code to publish immediately.
Below is the onTimerEvent callback:
`org.apache.hadoop.metrics2.impl.MetricsSystemImpl#onTimerEvent`
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-11339
## How was this patch tested?
This patch is tested manually by running a docker cluster using Prometheus
support enabled and validates the metrics being published to prometheus
instance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]