I'd say quantiles are a great idea for describing data of any distribution.
 They're hard/impossible to do in SQL though.


On Wed, Apr 16, 2014 at 2:09 PM, Gergo Tisza <[email protected]> wrote:

> On Wed, Apr 16, 2014 at 7:24 AM, Aaron Halfaker 
> <[email protected]>wrote:
>
>> It turns out that much of this performance data is log-normally
>> distributed[1].    Log-normal distributions tend to have a hockey stick
>> shape where most of the values are close to zero, but occasionally very
>> large values appear[3].  Taking the mean of a log-normal distributions tend
>> to be sensitive to outliers like the ones you describe.
>>
>> A solution to this problem is to generate a geometric mean[2] instead.
>>  One convenient thing about log-normal data is that if you log() it, it
>> becomes normal[4] -- and not sensitive to outliers in the usual way.  Also
>> convenient, geometric means are super easy to generate.  All you need to do
>> is this: (1) pass all of the data through log() (2) pass the same data
>> through mean() (or avg() -- whatever) (3) pass the result through exp().
>>  The best thing about this is that you can do it in MySQL.
>>
>> For example:
>>
>> SELECT
>>   country,
>>   mean(timings) AS regular_mean,
>>   exp(log(mean(timings)) AS geomteric_mean
>> FROM log.WhateverSchemaYouveGot
>> GROUP BY country
>>
>>
> Thanks, that sounds super simple!
>
> What about quantiles in general? Even if the outlier issue is solved, we
> planned to have stats like speed of image display in the 90th percentile,
> and that still poses the same SQL problem. Or are quantiles unhelpful for
> lognormal distributions in general?
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to