[jira] [Comment Edited] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values

2017-11-02 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235883#comment-16235883
 ] 

Erik Krogen edited comment on HADOOP-14989 at 11/2/17 3:14 PM:
---

Assuming nothing else was submitting block reports, then {{val}} with the 
current code would be 10, but it should be 5005 (it is an average so {{= 
10010/2}}). Since it is taking metrics from a minicluster there are also 
some real block reports that skew things; that's why I used a big value and a 
comparison rather than equal. Like I said, hacky. But the test will 
definitively pass if you omit the JMX call and definitely fail if you include 
it. I'll try to put together a real unit test for this.

I am not sure what you mean about JMX mbean calling reset internally. Are you 
talking here about the metrics2 level reset 
({{MetricsSourceAdapter#updateJmxCache()}}) or something at a JVM level? I 
explained how the cache reset is managed at the metrics2 level; let me know if 
there's something about my explanation that was not clear.


was (Author: xkrogen):
Assuming nothing else was submitting block reports, then {{val}} with the 
current code would be 10, but it should be 5005 (it is an average so {{= 
10010/2}}). Since it is taking metrics from a minicluster there are also 
some real block reports that skew things; that's why I used a big value and a 
comparison rather than equal. Like I said, hacky. But the test will 
definitively pass if you omit the JMX call and definitely fail if you include 
it. I'll try to put together a real unit test for this.

I am not sure what you mean about JMX mbean calling reset internally. Are you 
talking here about the metrics2 level reset 
({{MetricsSourceAdapter#updateJmxCache()}} or something at a JVM level? I 
explained how the cache reset is managed at the metrics2 level; let me know if 
there's something about my explanation that was not clear.

> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) 
> values
> ---
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Critical
> Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that each sink configured (including 
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If only a single sink ever 
> snapshots, this would result in the expected behavior that the value is the 
> average over the reporting period. However, if multiple sinks are configured, 
> or if the JMX cache is refreshed, this is another snapshot operation. So, for 
> example, if you have a FileSink configured at a 60 second interval and your 
> JMX cache refreshes itself 1 second before the FileSink period fires, the 
> values emitted to your FileSink only represent averages _over the last one 
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is 
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additionally_ emit the total as a counter. The snapshotted average is 
> probably sufficient in the common case (we've been using it for years), a

[jira] [Comment Edited] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values

2017-11-01 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234898#comment-16234898
 ] 

Erik Krogen edited comment on HADOOP-14989 at 11/1/17 10:50 PM:


One more comment:
{quote}
Total is a high watermark and it will eventually overflow.
{quote}
Sure it is a high watermark, but so are all of the {{MutableCounter}} metrics 
in Hadoop. These all rise indefinitely; I fail to see how this situation is any 
different.


was (Author: xkrogen):
One more comment:
> Total is a high watermark and it will eventually overflow.
Sure it is a high watermark, but so are all of the {{MutableCounter}} metrics 
in Hadoop. These all rise indefinitely; I fail to see how this situation is any 
different.

> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) 
> values
> ---
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Critical
> Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that each sink configured (including 
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If only a single sink ever 
> snapshots, this would result in the expected behavior that the value is the 
> average over the reporting period. However, if multiple sinks are configured, 
> or if the JMX cache is refreshed, this is another snapshot operation. So, for 
> example, if you have a FileSink configured at a 60 second interval and your 
> JMX cache refreshes itself 1 second before the FileSink period fires, the 
> values emitted to your FileSink only represent averages _over the last one 
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is 
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additionally_ emit the total as a counter. The snapshotted average is 
> probably sufficient in the common case (we've been using it for years), and 
> when more guaranteed accuracy is required, the average could be derived from 
> the total and operation count.
> Open to suggestions & input here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org