Hari Sekhon created AMBARI-24306:
------------------------------------

             Summary: Ambari Metrics + Grafana - add LastGcInfo duration graphs 
for all server components for all GCs - G1GC Young + Old Gens, CMS and 
ParallelNew
                 Key: AMBARI-24306
                 URL: https://issues.apache.org/jira/browse/AMBARI-24306
             Project: Ambari
          Issue Type: New Feature
          Components: ambari-metrics, metrics
            Reporter: Hari Sekhon


Feature Request to add Grafana graph of last value (not average please) 
LastGcInfo duration for all 3 major garbage collectors :
 * G1GC Young Gen
 * G1GC Old Generations
 * CMS
 * ParallelNew

CMS and ParNew example taken from NameNode JMX metrics:
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=ConcurrentMarkSweep",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 11,
      "duration" : 5206,
...
  }, {
    "name" : "java.lang:type=GarbageCollector,name=ParNew",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 11,
      "duration" : 6,
 {code}
G1GC Young and Old Gen example taken from RegionServer JMX metrics:
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=G1 Young Generation",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 24,
      "duration" : 120,
{code}
{code:java}
  }, {
    "name" : "java.lang:type=GarbageCollector,name=G1 Old Generation",
    "modelerType" : "sun.management.GarbageCollectorImpl",
    "LastGcInfo" : {
      "GcThreadCount" : 24,
      "duration" : 19641,
{code}
Yes this old gen GC is atrocious which is why I'm here to tune this, but it 
helps if this stuff is monitoring properly in the first place to know there is 
a problem without waiting until there are random RegionServer deaths due to 
long GC pauses.

Right now Ambari's Grafana has GCTimeMillis which would make one think this is 
not a problem as it only shows an averaged out 40ms per sec of GC time which 
isn't very helpful to spotting this long GC pause problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to