[
https://issues.apache.org/jira/browse/HADOOP-14960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226186#comment-16226186
]
Xiao Chen commented on HADOOP-14960:
------------------------------------
Thanks for contributing this pretty cool GC monitor Misha! Looks good to me in
general.
Review comments, mostly nits:
- High level, have you thought about using 1 data structure to hold both the
gcPause and timestamp ring buffers? I think a SortedMap would fit well in our
use case. The current way (2 arrays) may be the most efficient, I'd prefer
readability here since I assume this would be a very small part of memory
consumption for the service running it.
- Suggest to do some input validation in {{GcTimeMonitor}} constructor:
timestamps should not be negative, and also should not create too big a buffer
size (this may go away if we change data structures :) ).
{{maxGcTimePercentage}} also only makes sense for a (0, 100) value.
- This actually does not really discard any buffers
{code}
// Discard buffer entries that are older than curTime - observationWindowMs
long startObsWindowTs = ts - observationWindowMs;
while (tsBuf[startIdx] < startObsWindowTs && startIdx != endIdx) {
startIdx = (startIdx + 1) % bufSize;
}
{code}
- For the {{startIdx}} and {{endIdx}}, we should handle integer overflows of
{{(index + 1) % bufSize}}. Maybe have a method like {{incrementInRing}}.
- We can {{Preconditions.checkNotNull}} on the input param in
{{JvmMetrics#setGcTimeMonitor}}. (I know the current setPauseMonitor doesn't
check, let's do better than that :) )
- Should we make the methods synchronized so we don't have to worry about {{the
user observes inconsistent values}}?
- Typo in test:
{code}
// Run this for at least 1 sec for our monitor collects enough data
// to
// Run this for at least 1 sec for our monitor to collect enough data
{code}
> Add GC time percentage monitor/alerter
> --------------------------------------
>
> Key: HADOOP-14960
> URL: https://issues.apache.org/jira/browse/HADOOP-14960
> Project: Hadoop Common
> Issue Type: Improvement
> Reporter: Misha Dmitriev
> Assignee: Misha Dmitriev
> Attachments: HADOOP-14960.01.patch
>
>
> Currently class {{org.apache.hadoop.metrics2.source.JvmMetrics}} provides
> several metrics related to GC. Unfortunately, all these metrics are not as
> useful as they could be, because they don't answer the first and most
> important question related to GC and JVM health: what percentage of time my
> JVM is paused in GC? This percentage, calculated as the sum of the GC pauses
> over some period, like 1 minute, divided by that period - is the most
> convenient measure of the GC health because:
> - it is just one number, and it's clear that, say, 1..5% is good, but 80..90%
> is really bad
> - it allows for easy apple-to-apple comparison between runs, even between
> different apps
> - when this metric reaches some critical value like 70%, it almost always
> indicates a "GC death spiral", from which the app can recover only if it
> drops some task(s) etc.
> The existing "total GC time", "total number of GCs" etc. metrics only give
> numbers that can be used to rougly estimate this percentage. Thus it is
> suggested to add a new metric to this class, and possibly allow users to
> register handlers that will be automatically invoked if this metric reaches
> the specified threshold.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]