[ 
https://issues.apache.org/jira/browse/COUCHDB-396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Joseph Davis updated COUCHDB-396:
--------------------------------------

    Attachment: couchdb_stats_aggregator.patch

> Fixing weirdness in couch_stats_aggregator.erl
> ----------------------------------------------
>
>                 Key: COUCHDB-396
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-396
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Database Core, HTTP Interface
>    Affects Versions: 0.10
>         Environment: trunk
>            Reporter: Paul Joseph Davis
>            Assignee: Paul Joseph Davis
>             Fix For: 0.10
>
>         Attachments: couchdb_stats_aggregator.patch
>
>
> Looking at adding unit tests to the couchdb_stats_aggregator module the other 
> day I realized it was doing some odd calculations. This is a fairly 
> non-trivial patch so I figured that I'd put in JIRA and get feed back before 
> applying. This patch does everything the old version does afaict, but I'll be 
> adding tests before I consider it complete.
> List of major changes:
> * The old behavior for stats was to integrate incoming values for a time 
> period and then reset the values and start integrating again. That seemed a 
> bit odd so I rewrote things to keep the average and standard deviation for 
> the last N seconds with approximately 1 sample per second.
> * Changed request timing calculations [note below]
> * Sample periods are configurable in the .ini file. Sample periods of 0 are a 
> special case and integrate all values from couchdb boot up.
> * Sample descriptions are in the configuration files now.
> * You can request different time periods for the root stats end point.
> * Added a sum to the list of statistics
> * Simplified some of the external API
> The biggest change is in how time for requests are calculated. AFAICT, the 
> old way was accumulating request timings in the stats collector and just 
> adding new values as clock ticks went by as everything else does which makes 
> sense in the case of resetting counters every time period. In the new way I'm 
> keeping a list of the samples in the last time period and when I get a clock 
> tick part of the update is to remove the samples that have passed out of the 
> time period. For a variable like request_time this would lead to unbounded 
> storage.
> The new method is calculating the average time of all requests in a single 
> clock tick (1s). One thing this loses is when you start having lots of 
> variability in a single clock tick. Ie, your average request time is 100ms, 
> but 10% of your requests are taking 500ms. I've read of people doing the 
> averaging trick but also storing quantile information as well [1]. There are 
> also algorithms for doing single pass quantile estimation and the like so its 
> possible to do those things in O(N) time. The issue with quantiles is that 
> it'd start breaking the logic of how the collector and aggregators are setup. 
> As it is now, there's basically a one event -> one stat constraint. For the 
> time being I went without quartiles to minimize the impact of the patch.
> This code will also be on github [3] as I add patches.
> [1] http://code.flickr.com/blog/2008/10/27/counting-timing/
> [2] 
> http://www.slamb.org/svn/repos/trunk/projects/loadtest/benchtools/stats.py 
> (See the QuantileEstimator class)
> [3] http://github.com/davisp/couchdb/tree/stats-patch

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to