[ 
https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226186#comment-16226186
 ] 

Xiao Chen commented on HADOOP-14960:
------------------------------------

Thanks for contributing this pretty cool GC monitor Misha! Looks good to me in 
general.

Review comments, mostly nits:
- High level, have you thought about using 1 data structure to hold both the 
gcPause and timestamp ring buffers? I think a SortedMap would fit well in our 
use case. The current way (2 arrays) may be the most efficient, I'd prefer 
readability here since I assume this would be a very small part of memory 
consumption for the service running it.
- Suggest to do some input validation in {{GcTimeMonitor}} constructor: 
timestamps should not be negative, and also should not create too big a buffer 
size (this may go away if we change data structures :) ). 
{{maxGcTimePercentage}} also only makes sense for a (0, 100) value.
- This actually does not really discard any buffers
{code}
// Discard buffer entries that are older than curTime - observationWindowMs
     long startObsWindowTs = ts - observationWindowMs;
     while (tsBuf[startIdx] < startObsWindowTs && startIdx != endIdx) {
       startIdx = (startIdx + 1) % bufSize;
     }
{code}
- For the {{startIdx}} and {{endIdx}}, we should handle integer overflows of 
{{(index + 1) % bufSize}}. Maybe have a method like {{incrementInRing}}.
- We can {{Preconditions.checkNotNull}} on the input param in 
{{JvmMetrics#setGcTimeMonitor}}. (I know the current setPauseMonitor doesn't 
check, let's do better than that :) )
- Should we make the methods synchronized so we don't have to worry about {{the 
user observes inconsistent values}}?
- Typo in test:
{code}
// Run this for at least 1 sec for our monitor collects enough data
// to
// Run this for at least 1 sec for our monitor to collect enough data
{code}

> Add GC time percentage monitor/alerter
> --------------------------------------
>
>                 Key: HADOOP-14960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14960
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HADOOP-14960.01.patch
>
>
> Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides 
> several metrics related to GC. Unfortunately, all these metrics are not as 
> useful as they could be, because they don't answer the first and most 
> important question related to GC and JVM health: what percentage of time my 
> JVM is paused in GC? This percentage, calculated as the sum of the GC pauses 
> over some period, like 1 minute, divided by that period - is the most 
> convenient measure of the GC health because:
> - it is just one number, and it's clear that, say, 1..5% is good, but 80..90% 
> is really bad
> - it allows for easy apple-to-apple comparison between runs, even between 
> different apps
> - when this metric reaches some critical value like 70%, it almost always 
> indicates a "GC death spiral", from which the app can recover only if it 
> drops some task(s) etc.
> The existing "total GC time", "total number of GCs" etc. metrics only give 
> numbers that can be used to rougly estimate this percentage. Thus it is 
> suggested to add a new metric to this class, and possibly allow users to 
> register handlers that will be automatically invoked if this metric reaches 
> the specified threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to