[PR] HDDS-11339. Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue quickly. [ozone]

via GitHub Mon, 19 Aug 2024 03:52:29 -0700


devmadhuu opened a new pull request, #7092:
URL: https://github.com/apache/ozone/pull/7092


   ## What changes were proposed in this pull request?
   HDDS-11339. Publishing hadoop metrics immediately in Prometheus sink fills 
up SinkQueue quickly.
   
   Two issues:
   
   `PrometheusServlet` is being registered with `BaseHttpServer` when 
prometheus support is enabled and `PrometheusServlet` is being called every 15 
secs by default as scraping interval and it publishes the hadoop metrics 
immediately. So if there are large number of metrics needs to be published in a 
very busy cluster, this makes SinkQueue gets filled up quickly and then sink 
cannot consume the given metrics and just dropped them outright.
   A part from dropping , another issue is taking the object lock of 
`MetricsSystemImpl` class and before metrics actually being published, other 
threads keeps waiting to take the object lock. There was a recent issue came to 
highlight where in a busy cluster, there were ~ 190 threads BLOCKED just to 
acquire the lock of the  `MetricsSystemImpl` class. This makes Recon role 
unresponsive and after sometime JVM couldn't allocate sufficient memory and 
crashes with OOM. This prometheus and OOM issue is not related to Recon 
directly as this can happen with any role which is going to use Prometheus 
service in a busy cluster.
    
   
   `Solution:` We need not to publish the metrics immediately by calling 
   
    `DefaultMetricsSystem.instance().publishMetricsNow();`
    
   because a prometheus sink already have a mechanism to publish metrics every 
10 secs by default using call back with timer event. So we need to remove the 
above code to publish immediately.
   
   Below is the onTimerEvent callback:
   `org.apache.hadoop.metrics2.impl.MetricsSystemImpl#onTimerEvent`
   
   ## What is the link to the Apache JIRA
   https://issues.apache.org/jira/browse/HDDS-11339
   
   ## How was this patch tested?
   
   This patch is tested manually by running a docker cluster using Prometheus 
support enabled and validates the metrics being published to prometheus 
instance.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] HDDS-11339. Publishing hadoop metrics immediately in Prometheus sink fills up SinkQueue quickly. [ozone]

Reply via email to