Devesh Kumar Singh created HDDS-11339:
-----------------------------------------
Summary: Publishing hadoop metrics immediately in Prometheus sink
fills up SinkQueue quickly
Key: HDDS-11339
URL: https://issues.apache.org/jira/browse/HDDS-11339
Project: Apache Ozone
Issue Type: Task
Components: Ozone Manager, Ozone Recon
Affects Versions: 1.4.0
Reporter: Devesh Kumar Singh
Assignee: Devesh Kumar Singh
Two issues:
# PrometheusServlet is being registered with BaseHttpServer when prometheus
support is enabled and PrometheusServlet is being called every 15 secs by
default as scraping interval and it publishes the hadoop metrics immediately.
So if there are large number of metrics needs to be published in a very busy
cluster, this makes SinkQueue gets filled up quickly and then sink cannot
consume the given metrics and just dropped them outright.
# A part from dropping , another issue is taking the object lock of
MetricsSystemImpl class and before metrics actually being published, other
threads keeps waiting to take the object lock. There was a recent issue came to
highlight where in a busy cluster, there were ~ 190 threads BLOCKED just to
acquire the lock of the MetricsSystemImpl class. This makes Recon role
unresponsive and after sometime JVM couldn't allocate sufficient memory and
crashes with OOM. This OOM issue is not related to Recon directly as this can
happen with any role who is going to use Prometheus service in a busy cluster.
Solution: We need not to publish the metrics immediately by calling
{code:java}
DefaultMetricsSystem.instance().publishMetricsNow();
{code}
because a prometheus sink already have a mechanism to publish metrics every 10
secs by default using call back with timer event. So we need to remove the
above code to publish immediately.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]