[jira] [Commented] (CASSANDRA-12643) Estimated histograms tend to overflow
[ https://issues.apache.org/jira/browse/CASSANDRA-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667389#comment-15667389 ] Paulo Motta commented on CASSANDRA-12643: - bq. I think if we just show the values it would be basically easy to determine why the algorithm overflows. The name is fairly meaningless. If the values are 0,0,0 and that overflows we can feed that into a unit test and reason about what it is supposed to do. The values alone are fairly meaningless without the metric name, since a histogram is just a container of values, so an overflowed histogram will typically indicate a problem with the metric, and not with the histogram implementation. While I think the correct would be for the Histogram accessor to print the overflowed histogram in case of exceptions while retrieving values and not the histogram implementation itself, I understand that this would require reporters to be modified which is not always easy so I think it's acceptable to log the histogram values when fetching provided that the user enables this for troubleshooting. So, in order to not pollute {{debug.log}} we should log this at {{TRACE}} so the operator enables this when he wants to troubleshoot problems with overflowing histograms. I think we should also take this chance to improve the {{IllegalStateException}} message to include the metric name, so users will know what metric is overflowed when getting the exception and possibly enable {{TRACE}} on the {{org.apache.cassandra.metrics}} if he wants to further investigate the problem. For this, we probably need to include the metric name in histogram constructor. Another thing is that CASSANDRA-11752 changed the {{EstimatedHistogramReservoir}} to {{DecayingEstimatedHistogramReservoir}}, so I think you will want to work on this class instead. > Estimated histograms tend to overflow > - > > Key: CASSANDRA-12643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12643 > Project: Cassandra > Issue Type: Bug >Reporter: Edward Capriolo >Assignee: Edward Capriolo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12643) Estimated histograms tend to overflow
[ https://issues.apache.org/jira/browse/CASSANDRA-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658633#comment-15658633 ] Edward Capriolo commented on CASSANDRA-12643: - {quote} I think the right fix should actually be on LibratoReporter to print which metric is overflowed when getting an exception {quote} That would be the right fix but that is not the easy fix. Unfortunately, the abstraction provided by reporters only provide you a reportGauge(). You can not redefine that functionality without changing the metric-reporting classes. That is really the big problem the code looks like this: (and is not code defined in apache-cassandra) {noformat} report(){ for(Gauge g: gauges){ reportGauge(g); } for(Counter c: counters){ reportCounter(c); } {noformat} Basically most reporters do NOT even try catch so throwing any exception generally just causes the reporter to fail and result in only some things getting counter. Basically nothing anywhere exceptions any metric to throw any exception or assertion. I agree it would be idea to see name/ values. But its not easy. I think if we just show the values it would be basically easy to determine why the algorithm overflows. The name is fairly meaningless. If the values are 0,0,0 we can feed that into a unit test and reason about what it is supposed to do. > Estimated histograms tend to overflow > - > > Key: CASSANDRA-12643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12643 > Project: Cassandra > Issue Type: Bug >Reporter: Edward Capriolo >Assignee: Edward Capriolo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12643) Estimated histograms tend to overflow
[ https://issues.apache.org/jira/browse/CASSANDRA-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15658234#comment-15658234 ] Paulo Motta commented on CASSANDRA-12643: - I'm not sure how helpful this will be without the metric name, since you get the histogram values but not know which metric it refers to. I think the right fix should actually be on {{LibratoReporter}} to print which metric is overflowed when getting an exception, in addition to the histogram values. With that said, I think we should include the histogram buckets in the {{IllegalStateException}} message, so it is logged in the right context possibly indicating the metric name, rather than blindly logging it without context in the {{EstimatedHistogram}} class. WDYT? > Estimated histograms tend to overflow > - > > Key: CASSANDRA-12643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12643 > Project: Cassandra > Issue Type: Bug >Reporter: Edward Capriolo >Assignee: Edward Capriolo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12643) Estimated histograms tend to overflow
[ https://issues.apache.org/jira/browse/CASSANDRA-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15575478#comment-15575478 ] Edward Capriolo commented on CASSANDRA-12643: - This patch would help administrators understand what overflows and why. It can be used to refine the algorithm to fail less. Comments appreciated. > Estimated histograms tend to overflow > - > > Key: CASSANDRA-12643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12643 > Project: Cassandra > Issue Type: Bug >Reporter: Edward Capriolo >Assignee: Edward Capriolo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12643) Estimated histograms tend to overflow
[ https://issues.apache.org/jira/browse/CASSANDRA-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494465#comment-15494465 ] Edward Capriolo commented on CASSANDRA-12643: - Also I wanted to point out something. The EstimatedHistogram is used for non-reporting cases. If you do a usage search there are some internal structures that are sized based on this. While in practice it may not be a problem, I am struggling with an "estimator" throw throws a RuntimeException. Afterall it is an estimate. None of the things that call it check and do anything for this exception. Theoretically this could cause a process to never complete. Thinking about two methods. One that always returns data subclasses do not bubble up into reporter. Possibly a second with a check/unchecked exception so that things calling it can have a fall back logic. > Estimated histograms tend to overflow > - > > Key: CASSANDRA-12643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12643 > Project: Cassandra > Issue Type: Bug >Reporter: Edward Capriolo >Assignee: Edward Capriolo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12643) Estimated histograms tend to overflow
[ https://issues.apache.org/jira/browse/CASSANDRA-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491695#comment-15491695 ] Edward Capriolo commented on CASSANDRA-12643: - https://github.com/apache/cassandra/compare/trunk...edwardcapriolo:CASSANDRA-12643 It would be nice to know which data it was, but at least seeing the data should help us understand this. > Estimated histograms tend to overflow > - > > Key: CASSANDRA-12643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12643 > Project: Cassandra > Issue Type: Bug >Reporter: Edward Capriolo >Assignee: Edward Capriolo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-12643) Estimated histograms tend to overflow
[ https://issues.apache.org/jira/browse/CASSANDRA-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15491600#comment-15491600 ] Edward Capriolo commented on CASSANDRA-12643: - After enabling a reporter and starting up Cassandra I have observed the following stack trace. {noformat} java.lang.IllegalStateException: Unable to compute when histogram overflowed at org.apache.cassandra.utils.EstimatedHistogram.percentile(EstimatedHistogram.java:198) ~apache-cassandra-3.0.8.jar:3.0.8 at org.apache.cassandra.metrics.EstimatedHistogramReservoir$HistogramSnapshot.getValue(EstimatedHistogramReservoir.java:85) ~apache-cassandra-3.0.8.jar:3.0.8 at com.codahale.metrics.Snapshot.getMedian(Snapshot.java:38) ~metrics-core-3.1.0.jar:3.1.0 at com.librato.metrics.MetricsLibratoBatch.addSampling(MetricsLibratoBatch.java:144) ~metrics-librato-4.1.2.5.jar:na at com.librato.metrics.MetricsLibratoBatch.addHistogram(MetricsLibratoBatch.java:124) ~metrics-librato-4.1.2.5.jar:na at com.librato.metrics.LibratoReporter.report(LibratoReporter.java:167) ~metrics-librato-4.1.2.5.jar:na at com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) ~metrics-core-3.1.0.jar:3.1.0 at com.librato.metrics.LibratoReporter.report(LibratoReporter.java:127) ~metrics-librato-4.1.2.5.jar:na at com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) metrics-core-3.1.0.jar:3.1.0 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) na:1.8.0_101 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) na:1.8.0_101 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) na:1.8.0_101 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) na:1.8.0_101 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) na:1.8.0_101 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) na:1.8.0_101 at java.lang.Thread.run(Thread.java:745) na:1.8.0_101 {noformat} According to the inline documentation the largest bucket should accommodate 36 seconds. The server only being alive for a few seconds it seems unlikely that these can be overflowed. This overflow bubbles up to the report and this prevents data from being exported. I poked around and do not understand why a fresh server would overflow. I found some nits in the code I can fix and maybe someone with more brain power can chime in. > Estimated histograms tend to overflow > - > > Key: CASSANDRA-12643 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12643 > Project: Cassandra > Issue Type: Bug >Reporter: Edward Capriolo >Assignee: Edward Capriolo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)