[ 
https://issues.apache.org/jira/browse/HDDS-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDDS-11339:
----------------------------------
    Labels: pull-request-available  (was: )

> Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue 
> quickly
> -----------------------------------------------------------------------------------
>
>                 Key: HDDS-11339
>                 URL: https://issues.apache.org/jira/browse/HDDS-11339
>             Project: Apache Ozone
>          Issue Type: Task
>          Components: Ozone Manager, Ozone Recon
>    Affects Versions: 1.4.0
>            Reporter: Devesh Kumar Singh
>            Assignee: Devesh Kumar Singh
>            Priority: Major
>              Labels: pull-request-available
>
> Two issues:
>  # PrometheusServlet is being registered with BaseHttpServer when prometheus 
> support is enabled and PrometheusServlet is being called every 15 secs by 
> default as scraping interval and it publishes the hadoop metrics immediately. 
> So if there are large number of metrics needs to be published in a very busy 
> cluster, this makes SinkQueue gets filled up quickly and then sink cannot 
> consume the given metrics and just dropped them outright.
>  # A part from dropping , another issue is taking the object lock of 
> MetricsSystemImpl class and before metrics actually being published, other 
> threads keeps waiting to take the object lock. There was a recent issue came 
> to highlight where in a busy cluster, there were ~ 190 threads BLOCKED just 
> to acquire the lock of the  MetricsSystemImpl class. This makes Recon role 
> unresponsive and after sometime JVM couldn't allocate sufficient memory and 
> crashes with OOM. This OOM issue is not related to Recon directly as this can 
> happen with any role who is going to use Prometheus service in a busy cluster.
>  
> Solution: We need not to publish the metrics immediately by calling 
>  
> {code:java}
> DefaultMetricsSystem.instance().publishMetricsNow();
>  
> {code}
> because a prometheus sink already have a mechanism to publish metrics every 
> 10 secs by default using call back with timer event. So we need to remove the 
> above code to publish immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to