[
https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16245274#comment-16245274
]
Xiao Chen commented on HADOOP-14960:
------------------------------------
Thanks Misha for the new changes by incorporating the timestamp and gctime into
a class, and using a {{GcData}} class to handle update atomicity, looks pretty
good!
Please fix the checkstyle warnings. While you're at it, I have a few minor
comments :)
- Can we also {{setName}} on the {{GcTimeMonitor}} class, for better
debuggability?
- Let's add a precondition check on {{bufSize}} too, to make sure we don't
allocate crazy sizes here (say, 1M?)
- trivial Javadoc comments:
{{put a limit on a number of GCTimeMonitor instances}} s/a number/the number/g
{{@param observationWindowMs a period until now, over which the percentage}}
s/a period until now, over which/the interval over which/
- We usually use javadoc comment style on the ASF license class header. Could
you update {{GcTimeMonitor}}'s first line from {{/\*}} to {{/\*\*}}?
> Add GC time percentage monitor/alerter
> --------------------------------------
>
> Key: HADOOP-14960
> URL: https://issues.apache.org/jira/browse/HADOOP-14960
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Attachments: HADOOP-14960.01.patch, HADOOP-14960.02.patch,
> HADOOP-14960.03.patch
>
>
> Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides
> several metrics related to GC. Unfortunately, all these metrics are not as
> useful as they could be, because they don't answer the first and most
> important question related to GC and JVM health: what percentage of time my
> JVM is paused in GC? This percentage, calculated as the sum of the GC pauses
> over some period, like 1 minute, divided by that period - is the most
> convenient measure of the GC health because:
> - it is just one number, and it's clear that, say, 1..5% is good, but 80..90%
> is really bad
> - it allows for easy apple-to-apple comparison between runs, even between
> different apps
> - when this metric reaches some critical value like 70%, it almost always
> indicates a "GC death spiral", from which the app can recover only if it
> drops some task(s) etc.
> The existing "total GC time", "total number of GCs" etc. metrics only give
> numbers that can be used to rougly estimate this percentage. Thus it is
> suggested to add a new metric to this class, and possibly allow users to
> register handlers that will be automatically invoked if this metric reaches
> the specified threshold.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]