[
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234894#comment-16234894
]
Erik Krogen commented on HADOOP-14989:
--------------------------------------
Thank you for the comments [~eyang]! You actually made me realize I had a bit
of a misunderstanding after digging into the code further. Let me try again:
* The problem I described is definitely an issue if you specify multiple
refresh rates. I agree there's not a great way around this issue but I think we
should, at minimum, put something in the documentation indicating that it is
not a good idea. Right now the behavior I describe when dealing with
MutableRate values is not documented and would come as a surprise to an
operator.
* Specifying only a single refresh rate does not solve the JMX issue. The
single-point collection of metrics for all sinks occurs in
{{MetricsSystemImpl}}, specifically {{sampleMetrics()}}, which then passes off
the single {{MetricsBuffer}} to all sinks. This is great. However, JMX avoids
the {{MetricsSystemImpl}} code altogether, instead directly calling
{{getMetrics()}} on each {{MetricsSourceAdapter}}. Thus JMX cache refills can
destroy metrics values even if you correctly configure only one period. I have
attached a patch, [^HADOOP-14989.test.patch], which demonstrates this issue -
it's hacky but it should get the point across.
It seems to me the best way to fix this is to save the output values each time
{{getMetrics()}} is called and use those for the cache. We can either
* Call {{updateJmxCache()}} at the end of {{getMetrics()}} with the computed
values
* Store the return value of {{getMetrics()}} and use it as the input for
{{updateJmxCache()}} next it is called, assuming that value is fresh enough.
The second is considerably more complex. It avoids some potential performance
penalty of the {{updateAttrCache()}} and {{updateInfoCache()}} calls, which do
create a bunch of objects. Not sure if it would be enough to be worth the extra
complexity.
While digging / testing I also noticed another bug which occurs if you have
multiple sink periods set; see HADOOP-15008
> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate)
> values
> -----------------------------------------------------------------------------------
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
> Issue Type: Bug
> Components: metrics
> Affects Versions: 2.6.5
> Reporter: Erik Krogen
> Priority: Critical
> Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it
> is based off of {{MutableStat}}) mean that each sink configured (including
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since
> last snapshot, as well as operation count since last snapshot. Upon
> snapshotting, the average is calculated as (total / opCount) and placed into
> a gauge metric, and total / operation count are cleared. So the average value
> represents the average since the last snapshot. If only a single sink ever
> snapshots, this would result in the expected behavior that the value is the
> average over the reporting period. However, if multiple sinks are configured,
> or if the JMX cache is refreshed, this is another snapshot operation. So, for
> example, if you have a FileSink configured at a 60 second interval and your
> JMX cache refreshes itself 1 second before the FileSink period fires, the
> values emitted to your FileSink only represent averages _over the last one
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the
> last quantile values that it will serve up until the next refresh. Given how
> many {{MutableRate}} metrics there are, a thread per metric is not really
> feasible, but could be done on e.g. a per-source basis. This has some
> downsides: if multiple sinks are configured with different periods, what is
> the right refresh period for the {{MutableRate}}?
> * Make {{MutableRate}} emit two counters, one for total and one for operation
> count, rather than an average gauge and an operation count counter. The
> average could then be calculated downstream from this information. This is
> cumbersome for operators and not backwards compatible. To improve on both of
> those downsides, we could have it keep the current behavior but
> _additionally_ emit the total as a counter. The snapshotted average is
> probably sufficient in the common case (we've been using it for years), and
> when more guaranteed accuracy is required, the average could be derived from
> the total and operation count.
> Open to suggestions & input here.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]