[ 
https://issues.apache.org/jira/browse/AMBARI-24244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated AMBARI-24244:
---------------------------------
    Summary: Grafana HBase GC Time graph wrong / misleading - hiding large GC 
pauses ~ 2 dozen secs!  (was: Grafana HBase GC Time graph wrong / misleading - 
hiding large GC pauses)

> Grafana HBase GC Time graph wrong / misleading - hiding large GC pauses ~ 2 
> dozen secs!
> ---------------------------------------------------------------------------------------
>
>                 Key: AMBARI-24244
>                 URL: https://issues.apache.org/jira/browse/AMBARI-24244
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-metrics, metrics
>    Affects Versions: 2.5.2
>            Reporter: Hari Sekhon
>            Priority: Major
>
> Ambari's in-built Grafana graph for "JVM GC Times" graph in the HBase - 
> RegionServers dashboard is very wrong and doesn't reflect the times I've 
> grepped across HBase RegionServer logs for util.JvmPauseMonitor.
> I've inherited a very heavily loaded HBase + OpenTSDB cluster where there are 
> RegionServer losses occurring due to GCs around 30 seconds(!) causing ZK + 
> HMaster to declare them dead. The Grafana graphs show peaks around 70ms due 
> to averaging the GC time spent over all seconds, which smooths out the peaks 
> so as to not show any problem. If you are going to use GCTimeMillis then I 
> believe you need to divide by GCCount.
> Otherwise I believe this is actually the wrong metric to be watching and 
> instead the following metric from HBase JMX should be monitored with a value 
> of last. This does show the significant GC time spent:
> {code:java}
> java.lang:type=GarbageCollector,name=G1 Old Generation -> LastGcInfo -> 
> duration{code}
> Obviously make it search for a regex to match whichever garbage collector you 
> are using, whether G1 or CMS etc:
> {code:java}
> java.lang:type=GarbageCollector,name=.*Old Gen.*  -> LastGcInfo -> 
> duration{code}
> Right now the GC Times graph is worse than useless, it's misleading as it 
> implies there are no GC issues when there are actually very large very severe 
> GC issues on this cluster.
> This is a vanilla Ambari deployed Grafana with Ambari Metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to