Erik Krogen created HADOOP-14989:
------------------------------------
Summary: Multiple metrics2 sinks (incl JMX) result in inconsistent
Mutable(Stat|Rate) values
Key: HADOOP-14989
URL: https://issues.apache.org/jira/browse/HADOOP-14989
Project: Hadoop Common
Issue Type: Bug
Components: metrics
Affects Versions: 2.6.5
Reporter: Erik Krogen
Priority: Critical
While doing some digging in the metrics2 system recently, we noticed that the
way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it is
based off of {{MutableStat}}) mean that each sink configured (including JMX)
only receives a portion of the average information.
{{MutableStat}}, to compute its average value, maintains a total value since
last snapshot, as well as operation count since last snapshot. Upon
snapshotting, the average is calculated as (total / opCount) and placed into a
gauge metric, and total / operation count are cleared. So the average value
represents the average since the last snapshot. If only a single sink ever
snapshots, this would result in the expected behavior that the value is the
average over the reporting period. However, if multiple sinks are configured,
or if the JMX cache is refreshed, this is another snapshot operation. So, for
example, if you have a FileSink configured at a 60 second interval and your JMX
cache refreshes itself 1 second before the FileSink period fires, the values
emitted to your FileSink only represent averages _over the last one second_.
A few ways to solve this issue:
* From an operator perspective, ensure only one sink is configured. This is not
realistic given that the JMX cache exhibits the same behavior.
* Make {{MutableRate}} manage its own average refresh, similar to
{{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the
last quantile values that it will serve up until the next refresh. Given how
many {{MutableRate}} metrics there are, a thread per metric is not really
feasible, but could be done on e.g. a per-source basis. This has some
downsides: if multiple sinks are configured with different periods, what is the
right refresh period for the {{MutableRate}}?
* Make {{MutableRate}} emit two counters, one for total and one for operation
count, rather than an average gauge and an operation count counter. The average
could then be calculated downstream from this information. This is cumbersome
for operators and not backwards compatible. To improve on both of those
downsides, we could have it keep the current behavior but _additionally_ emit
the total as a counter. The snapshotted average is probably sufficient in the
common case (we've been using it for years), and when more guaranteed accuracy
is required, the average could be derived from the total and operation count.
Open to suggestions & input here.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]