[
https://issues.apache.org/jira/browse/HDDS-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HDDS-11339:
----------------------------------
Labels: pull-request-available (was: )
> Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue
> quickly
> -----------------------------------------------------------------------------------
>
> Key: HDDS-11339
> URL: https://issues.apache.org/jira/browse/HDDS-11339
> Project: Apache Ozone
> Issue Type: Task
> Components: Ozone Manager, Ozone Recon
> Affects Versions: 1.4.0
> Reporter: Devesh Kumar Singh
> Assignee: Devesh Kumar Singh
> Priority: Major
> Labels: pull-request-available
>
> Two issues:
> # PrometheusServlet is being registered with BaseHttpServer when prometheus
> support is enabled and PrometheusServlet is being called every 15 secs by
> default as scraping interval and it publishes the hadoop metrics immediately.
> So if there are large number of metrics needs to be published in a very busy
> cluster, this makes SinkQueue gets filled up quickly and then sink cannot
> consume the given metrics and just dropped them outright.
> # A part from dropping , another issue is taking the object lock of
> MetricsSystemImpl class and before metrics actually being published, other
> threads keeps waiting to take the object lock. There was a recent issue came
> to highlight where in a busy cluster, there were ~ 190 threads BLOCKED just
> to acquire the lock of the MetricsSystemImpl class. This makes Recon role
> unresponsive and after sometime JVM couldn't allocate sufficient memory and
> crashes with OOM. This OOM issue is not related to Recon directly as this can
> happen with any role who is going to use Prometheus service in a busy cluster.
>
> Solution: We need not to publish the metrics immediately by calling
>
> {code:java}
> DefaultMetricsSystem.instance().publishMetricsNow();
>
> {code}
> because a prometheus sink already have a mechanism to publish metrics every
> 10 secs by default using call back with timer event. So we need to remove the
> above code to publish immediately.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]