[jira] [Commented] (HADOOP-14960) Add GC time percentage monitor/alerter

Misha Dmitriev (JIRA) Tue, 31 Oct 2017 17:27:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16229368#comment-16229368
 ]


Misha Dmitriev commented on HADOOP-14960:
-----------------------------------------

Thank you for the review, [~xiaochen]. See my responses inline below:

{quote}
High level, have you thought about using 1 data structure to hold both the 
gcPause and timestamp ring buffers? I think a SortedMap would fit well in our 
use case. The current way (2 arrays) may be the most efficient, I'd prefer 
readability here since I assume this would be a very small part of memory 
consumption for the service running it.
{quote}

I thought, but this makes my performance-oriented soul hurt. After optimizing 
so many applications storing numerical data in suboptimal format, I can hardly 
force myself to use an object-oriented data structure for something that I know 
in advance is just a fixed-size set of ints... And then, I think a ring buffer 
is not the most sophisticated/unreadable data structure. We could use something 
like an ArrayDeque here (which employs a ring buffer internally), but will 
still need to have similar code to move to the first entry that is within the 
observation window, etc.

{quote}
Suggest to do some input validation in GcTimeMonitor constructor: timestamps 
should not be negative, and also should not create too big a buffer size (this 
may go away if we change data structures ). maxGcTimePercentage also only makes 
sense for a (0, 100) value.
{quote}

Done.

{quote}
This actually does not really discard any buffers...
{quote}

Updated the comment to make this more clear.

{quote}
For the startIdx and endIdx, we should handle integer overflows of (index + 1) 
% bufSize. Maybe have a method like incrementInRing.
{quote}

I think there is no need for that - where can we get an overflow in this 
{{index = (index + 1) % bufSize;}} expression? It's designed to keep {{index}} 
within the {{0 .. bufSize - 1}} interval.

{quote}
We can Preconditions.checkNotNull on the input param in 
JvmMetrics#setGcTimeMonitor. (I know the current setPauseMonitor doesn't check, 
let's do better than that )
{quote}

Done.

{quote}
Should we make the methods synchronized so we don't have to worry about the 
user observes inconsistent values?
{quote}

This is tricky... We can, unfortunately, get inconsistent values from public 
{{getXXX()}} methods of {{GcTimeMonitor}} in any case, because all these 
methods are called separately and return one value each. So it can always 
happens that between a call to say {{getTotalGcTimeMillis()}} and 
{{getTotalGcCount()}} one of these would be updated. In practice I think this 
doesn't matter much, since this inconsistency will be just temporary and this 
information is, after all, expected to be used for monitoring, not for some 
precise calculations I guess.

If we really want to return all this data in a consistent form, we should, at a 
minimum, return all these numbers in a single object by a single method (and 
then they indeed should be written into that object in a synchronized way, and 
maybe more synchronization will be needed in other places). For now I think 
this is too much work for too unobvious a benefit. But please let me know if 
you think differently.

{quote}
Typo in test: 
{quote}

Fixed.

> Add GC time percentage monitor/alerter
> --------------------------------------
>
>                 Key: HADOOP-14960
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14960
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>            Priority: Major
>         Attachments: HADOOP-14960.01.patch, HADOOP-14960.02.patch
>
>
> Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides 
> several metrics related to GC. Unfortunately, all these metrics are not as 
> useful as they could be, because they don't answer the first and most 
> important question related to GC and JVM health: what percentage of time my 
> JVM is paused in GC? This percentage, calculated as the sum of the GC pauses 
> over some period, like 1 minute, divided by that period - is the most 
> convenient measure of the GC health because:
> - it is just one number, and it's clear that, say, 1..5% is good, but 80..90% 
> is really bad
> - it allows for easy apple-to-apple comparison between runs, even between 
> different apps
> - when this metric reaches some critical value like 70%, it almost always 
> indicates a "GC death spiral", from which the app can recover only if it 
> drops some task(s) etc.
> The existing "total GC time", "total number of GCs" etc. metrics only give 
> numbers that can be used to rougly estimate this percentage. Thus it is 
> suggested to add a new metric to this class, and possibly allow users to 
> register handlers that will be automatically invoked if this metric reaches 
> the specified threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-14960) Add GC time percentage monitor/alerter

Reply via email to