I'd say quantiles are a great idea for describing data of any distribution. They're hard/impossible to do in SQL though.
On Wed, Apr 16, 2014 at 2:09 PM, Gergo Tisza <[email protected]> wrote: > On Wed, Apr 16, 2014 at 7:24 AM, Aaron Halfaker > <[email protected]>wrote: > >> It turns out that much of this performance data is log-normally >> distributed[1]. Log-normal distributions tend to have a hockey stick >> shape where most of the values are close to zero, but occasionally very >> large values appear[3]. Taking the mean of a log-normal distributions tend >> to be sensitive to outliers like the ones you describe. >> >> A solution to this problem is to generate a geometric mean[2] instead. >> One convenient thing about log-normal data is that if you log() it, it >> becomes normal[4] -- and not sensitive to outliers in the usual way. Also >> convenient, geometric means are super easy to generate. All you need to do >> is this: (1) pass all of the data through log() (2) pass the same data >> through mean() (or avg() -- whatever) (3) pass the result through exp(). >> The best thing about this is that you can do it in MySQL. >> >> For example: >> >> SELECT >> country, >> mean(timings) AS regular_mean, >> exp(log(mean(timings)) AS geomteric_mean >> FROM log.WhateverSchemaYouveGot >> GROUP BY country >> >> > Thanks, that sounds super simple! > > What about quantiles in general? Even if the outlier issue is solved, we > planned to have stats like speed of image display in the 90th percentile, > and that still poses the same SQL problem. Or are quantiles unhelpful for > lognormal distributions in general? > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Multimedia mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/multimedia
