[
https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16229368#comment-16229368
]
Misha Dmitriev commented on HADOOP-14960:
-----------------------------------------
Thank you for the review, [~xiaochen]. See my responses inline below:
{quote}
High level, have you thought about using 1 data structure to hold both the
gcPause and timestamp ring buffers? I think a SortedMap would fit well in our
use case. The current way (2 arrays) may be the most efficient, I'd prefer
readability here since I assume this would be a very small part of memory
consumption for the service running it.
{quote}
I thought, but this makes my performance-oriented soul hurt. After optimizing
so many applications storing numerical data in suboptimal format, I can hardly
force myself to use an object-oriented data structure for something that I know
in advance is just a fixed-size set of ints... And then, I think a ring buffer
is not the most sophisticated/unreadable data structure. We could use something
like an ArrayDeque here (which employs a ring buffer internally), but will
still need to have similar code to move to the first entry that is within the
observation window, etc.
{quote}
Suggest to do some input validation in GcTimeMonitor constructor: timestamps
should not be negative, and also should not create too big a buffer size (this
may go away if we change data structures ). maxGcTimePercentage also only makes
sense for a (0, 100) value.
{quote}
Done.
{quote}
This actually does not really discard any buffers...
{quote}
Updated the comment to make this more clear.
{quote}
For the startIdx and endIdx, we should handle integer overflows of (index + 1)
% bufSize. Maybe have a method like incrementInRing.
{quote}
I think there is no need for that - where can we get an overflow in this
{{index = (index + 1) % bufSize;}} expression? It's designed to keep {{index}}
within the {{0 .. bufSize - 1}} interval.
{quote}
We can Preconditions.checkNotNull on the input param in
JvmMetrics#setGcTimeMonitor. (I know the current setPauseMonitor doesn't check,
let's do better than that )
{quote}
Done.
{quote}
Should we make the methods synchronized so we don't have to worry about the
user observes inconsistent values?
{quote}
This is tricky... We can, unfortunately, get inconsistent values from public
{{getXXX()}} methods of {{GcTimeMonitor}} in any case, because all these
methods are called separately and return one value each. So it can always
happens that between a call to say {{getTotalGcTimeMillis()}} and
{{getTotalGcCount()}} one of these would be updated. In practice I think this
doesn't matter much, since this inconsistency will be just temporary and this
information is, after all, expected to be used for monitoring, not for some
precise calculations I guess.
If we really want to return all this data in a consistent form, we should, at a
minimum, return all these numbers in a single object by a single method (and
then they indeed should be written into that object in a synchronized way, and
maybe more synchronization will be needed in other places). For now I think
this is too much work for too unobvious a benefit. But please let me know if
you think differently.
{quote}
Typo in test:
{quote}
Fixed.
> Add GC time percentage monitor/alerter
> --------------------------------------
>
> Key: HADOOP-14960
> URL: https://issues.apache.org/jira/browse/HADOOP-14960
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Priority: Major
> Attachments: HADOOP-14960.01.patch, HADOOP-14960.02.patch
>
>
> Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides
> several metrics related to GC. Unfortunately, all these metrics are not as
> useful as they could be, because they don't answer the first and most
> important question related to GC and JVM health: what percentage of time my
> JVM is paused in GC? This percentage, calculated as the sum of the GC pauses
> over some period, like 1 minute, divided by that period - is the most
> convenient measure of the GC health because:
> - it is just one number, and it's clear that, say, 1..5% is good, but 80..90%
> is really bad
> - it allows for easy apple-to-apple comparison between runs, even between
> different apps
> - when this metric reaches some critical value like 70%, it almost always
> indicates a "GC death spiral", from which the app can recover only if it
> drops some task(s) etc.
> The existing "total GC time", "total number of GCs" etc. metrics only give
> numbers that can be used to rougly estimate this percentage. Thus it is
> suggested to add a new metric to this class, and possibly allow users to
> register handlers that will be automatically invoked if this metric reaches
> the specified threshold.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]