Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data

Emilio G. Cota Fri, 03 Jun 2016 10:26:38 -0700

On Sat, May 28, 2016 at 21:15:06 +0300, Sergey Fedorov wrote:
> On 25/05/16 04:13, Emilio G. Cota wrote:
> (snip)
> > +double qdist_avg(const struct qdist *dist)
> > +{
> > +    unsigned long count;
> > +    size_t i;
> > +    double ret = 0;
> > +
> > +    count = qdist_sample_count(dist);
> > +    if (!count) {
> > +        return NAN;
> > +    }
> > +    for (i = 0; i < dist->n; i++) {
> > +        struct qdist_entry *e = &dist->entries[i];
> > +
> > +        ret += e->x * e->count / count;
> 
> Please use Welford’s method or something like that, see
> http://stackoverflow.com/a/1346890.


Yes, the way the mean is computed right now, we might suffer
from underflow if count is huge. But I'd rather take that, than the
perf penalty of an iterative method (such as the one used
in Welford's). Note that we might have huge amounts of
items, e.g. one item per head bucket in qht's occupancy qdist
(and 0.5M head buckets is easy to achieve).

If we were to use an iterative method, we'd need to do something
like:

double qdist_avg(const struct qdist *dist)
{
    size_t i, j;
    double ret = 0;

    if (!qdist_sample_count(dist)) {
        return NAN;
    }
    /* compute moving average to prevent under/overflow */
    for (i = 0; i < dist->n; i++) {
        struct qdist_entry *e = &dist->entries[i];

        for (j = 0; j < e->count; j++) {

            ret += (e->x - ret) / (i + j + 1);
        }
    }
    return ret;
}

Note that skipping the inner loop would be incorrect.

I measured the time it takes to execute qdist_avg(&hst.occupancy) at the
end of booting debian jessie for ARM. The difference is
significant:

Original:  0.000002 s
Iterative: 0.002846 s

So really I think we should be OK with a potential underflow. If you want
I can add a comment to remind our future selves of these findings.

                Emilio

Re: [Qemu-devel] [PATCH v6 08/15] qdist: add module to represent frequency distributions of data

Reply via email to