On Wed, Apr 16, 2014 at 7:24 AM, Aaron Halfaker <[email protected]>wrote:

> It turns out that much of this performance data is log-normally
> distributed[1].    Log-normal distributions tend to have a hockey stick
> shape where most of the values are close to zero, but occasionally very
> large values appear[3].  Taking the mean of a log-normal distributions tend
> to be sensitive to outliers like the ones you describe.
>
> A solution to this problem is to generate a geometric mean[2] instead.
>  One convenient thing about log-normal data is that if you log() it, it
> becomes normal[4] -- and not sensitive to outliers in the usual way.  Also
> convenient, geometric means are super easy to generate.  All you need to do
> is this: (1) pass all of the data through log() (2) pass the same data
> through mean() (or avg() -- whatever) (3) pass the result through exp().
>  The best thing about this is that you can do it in MySQL.
>
> For example:
>
> SELECT
>   country,
>   mean(timings) AS regular_mean,
>   exp(log(mean(timings)) AS geomteric_mean
> FROM log.WhateverSchemaYouveGot
> GROUP BY country
>
>
Thanks, that sounds super simple!

What about quantiles in general? Even if the outlier issue is solved, we
planned to have stats like speed of image display in the 90th percentile,
and that still poses the same SQL problem. Or are quantiles unhelpful for
lognormal distributions in general?
_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to