[jira] [Created] (HDDS-11339) Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue quickly

Devesh Kumar Singh (Jira) Mon, 19 Aug 2024 03:22:04 -0700

Devesh Kumar Singh created HDDS-11339:
-----------------------------------------


             Summary: Publishing hadoop metrics immediately in Prometheus sink 
fills up SinkQueue quickly
                 Key: HDDS-11339
                 URL: https://issues.apache.org/jira/browse/HDDS-11339
             Project: Apache Ozone
          Issue Type: Task
          Components: Ozone Manager, Ozone Recon
    Affects Versions: 1.4.0
            Reporter: Devesh Kumar Singh
            Assignee: Devesh Kumar Singh


Two issues:
 # PrometheusServlet is being registered with BaseHttpServer when prometheus 
support is enabled and PrometheusServlet is being called every 15 secs by 
default as scraping interval and it publishes the hadoop metrics immediately. 
So if there are large number of metrics needs to be published in a very busy 
cluster, this makes SinkQueue gets filled up quickly and then sink cannot 
consume the given metrics and just dropped them outright.
 # A part from dropping , another issue is taking the object lock of 
MetricsSystemImpl class and before metrics actually being published, other 
threads keeps waiting to take the object lock. There was a recent issue came to 
highlight where in a busy cluster, there were ~ 190 threads BLOCKED just to 
acquire the lock of the  MetricsSystemImpl class. This makes Recon role 
unresponsive and after sometime JVM couldn't allocate sufficient memory and 
crashes with OOM. This OOM issue is not related to Recon directly as this can 
happen with any role who is going to use Prometheus service in a busy cluster.

 

Solution: We need not to publish the metrics immediately by calling 

 
{code:java}
DefaultMetricsSystem.instance().publishMetricsNow();
 
{code}
because a prometheus sink already have a mechanism to publish metrics every 10 
secs by default using call back with timer event. So we need to remove the 
above code to publish immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-11339) Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue quickly

Reply via email to