Hi,

an update to the stats patch.

Alex & I reworked the internals to make stats usable under high
load. and simplified the code significantly in the process. The
outlined API (see the quoted mail below) is implemented.

The JS test suite had to be disabled right now since stats
are only available with a one second delay and it is hard to
test that from the outside. We'll be adding functional test
on the Erlang level later to make sure we get everything
right.

I did a little testing and it seems to work just fine.

When the EUnit issue clears, I'd like to propose to move the
patch into a SVN branch for integration with trunk. We'd
highly appreciate if you give it a shot already.

Cheers
Alex & Jan
--

On 10 Feb 2009, at 16:19, Jan Lehnardt wrote:

Hi,

Alex and I are working on our stats package patch and the last
bigger issue is the API. It is just exposing a bunch of values by
keys, but as usual, the devil is in the details.

Let me explain.

There are two types of counters. "Hit Counters", that record
things like the number of requests. They increase monotonically
each time a request hits CouchDB. This is useful for counting
stuff. Cool.

Then there are "Absolute Value Counter" (for the lack of a better
term) that collects absolute values like the number of milliseconds
a request took to complete. To create a meaningful metric out
of this type of counter, we need to create averages. There's little
value in recording each individual request (it could still do that
in the access logs) for monitoring reports. So we keep some
aggregate values (min, max, mean, stddev, count (count being
the number of times this counter was called)).

Complexity++

Say you have a CouchDB running for a month. You change some
things in your app or in CouchDB and you'd like to know how this
affected your response time. To effectively see anything you'd have
to restart CouchDB (and lose all stats) or wait a month. If you'd
want to see problems coming up in your monitoring, you need finer
grained time ranges to look at this.

To make this a little more useful Alex and I introduced time ranges.
These are an additional set of aggregates that get reset every 1, 5
and 15 minutes. This should be familiar to you from server load.
You can get the aggregate values for four time ranges:

- Between now and the beginning of time (when CouchDB is
 started.
- Between now and 60 seconds ago.
- Between now and 300 seconds ago
- Between now and 900 seconds ago

These ranges are hardcoded now, but they can be made configurable
at a later time.

The API would look like this:

GET /_stats/couchdb/request_time

{
"couchdb": {
  "request_time": {
"description": "Aggregated request time spent in CouchDB since the beginning of time",
    "min":20,
    "max":20,
    "mean":20,
    "stddev":20,
    "count":7,
    "range":0 // 0 means since day zero.
  }
}
}

To get the aggregates stats for the last minute:

GET /_stats/couchdb/request_time?range=1

{
"couchdb": {
  "request_time": {
"description": "Aggregated request time spent in CouchDB since 1 minute ago",
    "min":20,
    "max":20,
    "mean":20,
    "stddev":20,
    "count":7,
    "range":1 // minute
  }
}
}

Or more generic:

GET /_stats/couchdb/request_time?range=$range

{
"couchdb": {
  "request_time": {
"description": "Aggregated request time spent in CouchDB since $range minute ago",
    "min":20,
    "max":20,
    "mean":20,
    "stddev":20,
    "count":7,
    "range":$range // minute
  }
}
}

This seems reasonable. the actual naming of "range" and other
keys can be changed as well as the description text.


Complexity--

Remember Hit Counters? Yes, strictly speaking, CouchDB shouldn't
want to collect any averages there since our monitoring solution
would take care of that. But then, there are the 4 time-range counters
available and we could just as well populate them as well. Let's
say every second:

GET /_stats/httpd/requests[?$resolution=[1,5,15]]

{
"couchdb": {
  "request_time": {
"description": "Number of requests per second seconds in the last $reolution minutes",
    "min":20,
    "max":20,
    "mean":20,
    "stddev":20,
    "count":7,
    "range":$range // minute
  }
}
}

"count" would be the raw counter for the stats and the rest meaningful
aggregates.

"per second" is an arbitrary choice again and can be made configurable, if needed. To know at what frequency stats are collected, there's a new
member in the list of aggregates:

{
"couchdb": {
  "request_time": {
"description": "Number of requests per $frequency seconds in the last $reolution minutes",
    "min":20,
    "max":20,
    "mean":20,
    "stddev":20,
    "count":7,
    "range":$range, // minute
    "frequency": 1 // second
  }
}
}

Alex I tried to find a couple of different approaches to get here. Different URLs for the different types of counters and aggregates, adding members in different places, with and without description and a whole lot more,
but we sure haven't seen all permutations.

This solution offers a unified URL format and a human readable as
well as a computer parseable way to determine what kind of counter
you're dealing with.

To just get all stats you can do a

GET /_stats/

and get a huge JSON object back that includes all of the above for all
resolutions that are currently collected.

Is there anything that does not make sense or is too complicated?

The goal was to create a simple, minimal API for a minimal set
of useful statistics and Alex and I hope to have found this by
now. But if you can see how this could be further simplified,
let us know :)

Alex and I also open for completely different approaches to get
the data out of CouchDB.

We're looking for a few things in this thread:

- A sanity check to know we're not completely off.
- A summary (for) you of our way of getting to the current proposal.
- A consensus of d...@-readers for the final API we'd like to implement.

Note that a few of these things are already implemented and
others need to be adjusted depending on feedback here.

Please, feed back,

Cheers
Alex & Jan
--



Reply via email to