[
https://issues.apache.org/jira/browse/COUCHDB-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Paul Joseph Davis updated COUCHDB-396:
--------------------------------------
Attachment: couchdb_stats_aggregator.patch
> Fixing weirdness in couch_stats_aggregator.erl
> ----------------------------------------------
>
> Key: COUCHDB-396
> URL: https://issues.apache.org/jira/browse/COUCHDB-396
> Project: CouchDB
> Issue Type: Improvement
> Components: Database Core, HTTP Interface
> Affects Versions: 0.10
> Environment: trunk
> Reporter: Paul Joseph Davis
> Assignee: Paul Joseph Davis
> Fix For: 0.10
>
> Attachments: couchdb_stats_aggregator.patch
>
>
> Looking at adding unit tests to the couchdb_stats_aggregator module the other
> day I realized it was doing some odd calculations. This is a fairly
> non-trivial patch so I figured that I'd put in JIRA and get feed back before
> applying. This patch does everything the old version does afaict, but I'll be
> adding tests before I consider it complete.
> List of major changes:
> * The old behavior for stats was to integrate incoming values for a time
> period and then reset the values and start integrating again. That seemed a
> bit odd so I rewrote things to keep the average and standard deviation for
> the last N seconds with approximately 1 sample per second.
> * Changed request timing calculations [note below]
> * Sample periods are configurable in the .ini file. Sample periods of 0 are a
> special case and integrate all values from couchdb boot up.
> * Sample descriptions are in the configuration files now.
> * You can request different time periods for the root stats end point.
> * Added a sum to the list of statistics
> * Simplified some of the external API
> The biggest change is in how time for requests are calculated. AFAICT, the
> old way was accumulating request timings in the stats collector and just
> adding new values as clock ticks went by as everything else does which makes
> sense in the case of resetting counters every time period. In the new way I'm
> keeping a list of the samples in the last time period and when I get a clock
> tick part of the update is to remove the samples that have passed out of the
> time period. For a variable like request_time this would lead to unbounded
> storage.
> The new method is calculating the average time of all requests in a single
> clock tick (1s). One thing this loses is when you start having lots of
> variability in a single clock tick. Ie, your average request time is 100ms,
> but 10% of your requests are taking 500ms. I've read of people doing the
> averaging trick but also storing quantile information as well [1]. There are
> also algorithms for doing single pass quantile estimation and the like so its
> possible to do those things in O(N) time. The issue with quantiles is that
> it'd start breaking the logic of how the collector and aggregators are setup.
> As it is now, there's basically a one event -> one stat constraint. For the
> time being I went without quartiles to minimize the impact of the patch.
> This code will also be on github [3] as I add patches.
> [1] http://code.flickr.com/blog/2008/10/27/counting-timing/
> [2]
> http://www.slamb.org/svn/repos/trunk/projects/loadtest/benchtools/stats.py
> (See the QuantileEstimator class)
> [3] http://github.com/davisp/couchdb/tree/stats-patch
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.