[jira] [Commented] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values

2017-11-02 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235883#comment-16235883
 ] 

Erik Krogen commented on HADOOP-14989:
--

Assuming nothing else was submitting block reports, then {{val}} with the 
current code would be 10, but it should be 5005 (it is an average so {{= 
10010/2}}). Since it is taking metrics from a minicluster there are also 
some real block reports that skew things; that's why I used a big value and a 
comparison rather than equal. Like I said, hacky. But the test will 
definitively pass if you omit the JMX call and definitely fail if you include 
it. I'll try to put together a real unit test for this.

I am not sure what you mean about JMX mbean calling reset internally. Are you 
talking here about the metrics2 level reset 
({{MetricsSourceAdapter#updateJmxCache()}} or something at a JVM level? I 
explained how the cache reset is managed at the metrics2 level; let me know if 
there's something about my explanation that was not clear.

> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) 
> values
> ---
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Critical
> Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that each sink configured (including 
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If only a single sink ever 
> snapshots, this would result in the expected behavior that the value is the 
> average over the reporting period. However, if multiple sinks are configured, 
> or if the JMX cache is refreshed, this is another snapshot operation. So, for 
> example, if you have a FileSink configured at a 60 second interval and your 
> JMX cache refreshes itself 1 second before the FileSink period fires, the 
> values emitted to your FileSink only represent averages _over the last one 
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is 
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additionally_ emit the total as a counter. The snapshotted average is 
> probably sufficient in the common case (we've been using it for years), and 
> when more guaranteed accuracy is required, the average could be derived from 
> the total and operation count.
> Open to suggestions & input here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values

2017-11-01 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16235048#comment-16235048
 ] 

Eric Yang commented on HADOOP-14989:


[~xkrogen] Line 233, The test patch is comparing double to int, and val is 
value of 10.00.  If it did not reset, shouldn't it be 10010?
As far as I know, JMX mbean calls resets internally, and there is no need to 
call it externally.  However, if multiple people are pulling from JMX, I don't 
know how the reset is managed.  Let me know if I misunderstood the test patch.

{code}
Sure it is a high watermark, but so are all of the MutableCounter metrics in 
Hadoop. These all rise indefinitely; I fail to see how this situation is any 
different.
{code}

True, this is less of a concern on 64 bits system.  Overflow happens less 
frequently.  

> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) 
> values
> ---
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Critical
> Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that each sink configured (including 
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If only a single sink ever 
> snapshots, this would result in the expected behavior that the value is the 
> average over the reporting period. However, if multiple sinks are configured, 
> or if the JMX cache is refreshed, this is another snapshot operation. So, for 
> example, if you have a FileSink configured at a 60 second interval and your 
> JMX cache refreshes itself 1 second before the FileSink period fires, the 
> values emitted to your FileSink only represent averages _over the last one 
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is 
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additionally_ emit the total as a counter. The snapshotted average is 
> probably sufficient in the common case (we've been using it for years), and 
> when more guaranteed accuracy is required, the average could be derived from 
> the total and operation count.
> Open to suggestions & input here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values

2017-11-01 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234898#comment-16234898
 ] 

Erik Krogen commented on HADOOP-14989:
--

One more comment:
> Total is a high watermark and it will eventually overflow.
Sure it is a high watermark, but so are all of the {{MutableCounter}} metrics 
in Hadoop. These all rise indefinitely; I fail to see how this situation is any 
different.

> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) 
> values
> ---
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Critical
> Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that each sink configured (including 
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If only a single sink ever 
> snapshots, this would result in the expected behavior that the value is the 
> average over the reporting period. However, if multiple sinks are configured, 
> or if the JMX cache is refreshed, this is another snapshot operation. So, for 
> example, if you have a FileSink configured at a 60 second interval and your 
> JMX cache refreshes itself 1 second before the FileSink period fires, the 
> values emitted to your FileSink only represent averages _over the last one 
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is 
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additionally_ emit the total as a counter. The snapshotted average is 
> probably sufficient in the common case (we've been using it for years), and 
> when more guaranteed accuracy is required, the average could be derived from 
> the total and operation count.
> Open to suggestions & input here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values

2017-11-01 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16234894#comment-16234894
 ] 

Erik Krogen commented on HADOOP-14989:
--

Thank you for the comments [~eyang]! You actually made me realize I had a bit 
of a misunderstanding after digging into the code further. Let me try again:
* The problem I described is definitely an issue if you specify multiple 
refresh rates. I agree there's not a great way around this issue but I think we 
should, at minimum, put something in the documentation indicating that it is 
not a good idea. Right now the behavior I describe when dealing with 
MutableRate values is not documented and would come as a surprise to an 
operator.
* Specifying only a single refresh rate does not solve the JMX issue. The 
single-point collection of metrics for all sinks occurs in 
{{MetricsSystemImpl}}, specifically {{sampleMetrics()}}, which then passes off 
the single {{MetricsBuffer}} to all sinks. This is great. However, JMX avoids 
the {{MetricsSystemImpl}} code altogether, instead directly calling 
{{getMetrics()}} on each {{MetricsSourceAdapter}}. Thus JMX cache refills can 
destroy metrics values even if you correctly configure only one period. I have 
attached a patch, [^HADOOP-14989.test.patch], which demonstrates this issue - 
it's hacky but it should get the point across.

It seems to me the best way to fix this is to save the output values each time 
{{getMetrics()}} is called and use those for the cache. We can either
* Call {{updateJmxCache()}} at the end of {{getMetrics()}} with the computed 
values
* Store the return value of {{getMetrics()}} and use it as the input for 
{{updateJmxCache()}} next it is called, assuming that value is fresh enough.

The second is considerably more complex. It avoids some potential performance 
penalty of the {{updateAttrCache()}} and {{updateInfoCache()}} calls, which do 
create a bunch of objects. Not sure if it would be enough to be worth the extra 
complexity.

While digging / testing I also noticed another bug which occurs if you have 
multiple sink periods set; see HADOOP-15008

> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) 
> values
> ---
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Priority: Critical
> Attachments: HADOOP-14989.test.patch
>
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that each sink configured (including 
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If only a single sink ever 
> snapshots, this would result in the expected behavior that the value is the 
> average over the reporting period. However, if multiple sinks are configured, 
> or if the JMX cache is refreshed, this is another snapshot operation. So, for 
> example, if you have a FileSink configured at a 60 second interval and your 
> JMX cache refreshes itself 1 second before the FileSink period fires, the 
> values emitted to your FileSink only represent averages _over the last one 
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is 
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additional

[jira] [Commented] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values

2017-10-29 Thread Eric Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16223883#comment-16223883
 ] 

Eric Yang commented on HADOOP-14989:


HI [~xkrogen],

Can we keep both sinks using the same refresh rate, like 10 seconds?  I would 
not recommend to have different refresh rate, this is comparing data samples at 
different frequency.  The resulting graph will not look the same.  Total is a 
high watermark and it will eventually overflow.  This is the reason that Hadoop 
community favored gauge system to minimize compute and interested to monitor 
metrics at real time only during the development phase.

If we want to produce high fidelity data samples.  Time stamp, previous count, 
current count, and Time passed since last sample (or refresh rate) are the 
essential information to record for high fidelity data samples, but post 
processing is more expensive.  Gauge and average are only good for measuring 
velocity of the metrics for a point in time.  Most monitoring system can only 
handle time precision at second or minute scale.  Hence, MutableRate is heavily 
dependent on time precision that the down stream can consume.  One important 
limitation is JMX cache reset requires JMX sink to be the last one in the chain 
with slowest refresh rate to avoid accurate problem like you described.  JMX 
sink should not have a lower refresh rate than FileSink to avoid destroying 
samples before data is sent.


> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) 
> values
> ---
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Priority: Critical
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that each sink configured (including 
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If only a single sink ever 
> snapshots, this would result in the expected behavior that the value is the 
> average over the reporting period. However, if multiple sinks are configured, 
> or if the JMX cache is refreshed, this is another snapshot operation. So, for 
> example, if you have a FileSink configured at a 60 second interval and your 
> JMX cache refreshes itself 1 second before the FileSink period fires, the 
> values emitted to your FileSink only represent averages _over the last one 
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is 
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additionally_ emit the total as a counter. The snapshotted average is 
> probably sufficient in the common case (we've been using it for years), and 
> when more guaranteed accuracy is required, the average could be derived from 
> the total and operation count.
> Open to suggestions & input here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14989) Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) values

2017-10-27 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16222652#comment-16222652
 ] 

Erik Krogen commented on HADOOP-14989:
--

Ping [~aw], [~eyang] based on involvement in initial metrics2 JIRAs 
(HADOOP-6919, HADOOP-6728)

> Multiple metrics2 sinks (incl JMX) result in inconsistent Mutable(Stat|Rate) 
> values
> ---
>
> Key: HADOOP-14989
> URL: https://issues.apache.org/jira/browse/HADOOP-14989
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.6.5
>Reporter: Erik Krogen
>Priority: Critical
>
> While doing some digging in the metrics2 system recently, we noticed that the 
> way {{MutableStat}} values are collected (and thus {{MutableRate}}, since it 
> is based off of {{MutableStat}}) mean that each sink configured (including 
> JMX) only receives a portion of the average information.
> {{MutableStat}}, to compute its average value, maintains a total value since 
> last snapshot, as well as operation count since last snapshot. Upon 
> snapshotting, the average is calculated as (total / opCount) and placed into 
> a gauge metric, and total / operation count are cleared. So the average value 
> represents the average since the last snapshot. If only a single sink ever 
> snapshots, this would result in the expected behavior that the value is the 
> average over the reporting period. However, if multiple sinks are configured, 
> or if the JMX cache is refreshed, this is another snapshot operation. So, for 
> example, if you have a FileSink configured at a 60 second interval and your 
> JMX cache refreshes itself 1 second before the FileSink period fires, the 
> values emitted to your FileSink only represent averages _over the last one 
> second_.
> A few ways to solve this issue:
> * From an operator perspective, ensure only one sink is configured. This is 
> not realistic given that the JMX cache exhibits the same behavior.
> * Make {{MutableRate}} manage its own average refresh, similar to 
> {{MutableQuantiles}}, which has a refresh thread and saves a snapshot of the 
> last quantile values that it will serve up until the next refresh. Given how 
> many {{MutableRate}} metrics there are, a thread per metric is not really 
> feasible, but could be done on e.g. a per-source basis. This has some 
> downsides: if multiple sinks are configured with different periods, what is 
> the right refresh period for the {{MutableRate}}? 
> * Make {{MutableRate}} emit two counters, one for total and one for operation 
> count, rather than an average gauge and an operation count counter. The 
> average could then be calculated downstream from this information. This is 
> cumbersome for operators and not backwards compatible. To improve on both of 
> those downsides, we could have it keep the current behavior but 
> _additionally_ emit the total as a counter. The snapshotted average is 
> probably sufficient in the common case (we've been using it for years), and 
> when more guaranteed accuracy is required, the average could be derived from 
> the total and operation count.
> Open to suggestions & input here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org