Misha Dmitriev created HADOOP-14960:
---------------------------------------

             Summary: Add GC time percentage monitor/alerter
                 Key: HADOOP-14960
                 URL: https://issues.apache.org/jira/browse/HADOOP-14960
             Project: Hadoop Common
          Issue Type: Improvement
            Reporter: Misha Dmitriev
            Assignee: Misha Dmitriev


Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides 
several metrics related to GC. Unfortunately, all these metrics are not as 
useful as they could be, because they don't answer the first and most important 
question related to GC and JVM health: what percentage of time my JVM is paused 
in GC? This percentage, calculated as the sum of the GC pauses over some 
period, like 1 minute, divided by that period - is the most convenient measure 
of the GC health because:
- it is just one number, and it's clear that, say, 1..5% is good, but 80..90% 
is really bad
- it allows for easy apple-to-apple comparison between runs, even between 
different apps
- when this metric reaches some critical value like 70%, it almost always 
indicates a "GC death spiral", from which the app can recover only if it drops 
some task(s) etc.

The existing "total GC time", "total number of GCs" etc. metrics only give 
numbers that can be used to rougly estimate this percentage. Thus it is 
suggested to add a new metric to this class, and possibly allow users to 
register handlers that will be automatically invoked if this metric reaches the 
specified threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to