Yikes! Good catch.
On Thu, Apr 17, 2014 at 11:12 AM, Gilles Dubuc <[email protected]> wrote: > A solution to this problem is to generate a geometric mean[2] instead. >> > > Thanks a lot for the help, it literally instantly solved my problem! > > There was a small mistake in the order of functions in your example, for > the record it should be: > > EXP(AVG(LOG(event_total))) AS geometric_mean > > And conveniently the geometric standard deviation can be calculated the > same way: > > EXP(STDDEV(LOG(event_total))) AS geometric_stddev > > I put it to the test on a specific set of data where we had a huge > outlier, and for that data it seems equivalent to excluding the lower and > upper 10 percentiles, which is exactly what I was after. > > > > > > On Wed, Apr 16, 2014 at 4:24 PM, Aaron Halfaker > <[email protected]>wrote: > >> Hi Gilles, >> >> I think I know just the thing you're looking for. >> >> It turns out that much of this performance data is log-normally >> distributed[1]. Log-normal distributions tend to have a hockey stick >> shape where most of the values are close to zero, but occasionally very >> large values appear[3]. Taking the mean of a log-normal distributions tend >> to be sensitive to outliers like the ones you describe. >> >> A solution to this problem is to generate a geometric mean[2] instead. >> One convenient thing about log-normal data is that if you log() it, it >> becomes normal[4] -- and not sensitive to outliers in the usual way. Also >> convenient, geometric means are super easy to generate. All you need to do >> is this: (1) pass all of the data through log() (2) pass the same data >> through mean() (or avg() -- whatever) (3) pass the result through exp(). >> The best thing about this is that you can do it in MySQL. >> >> For example: >> >> SELECT >> country, >> mean(timings) AS regular_mean, >> exp(log(mean(timings)) AS geomteric_mean >> FROM log.WhateverSchemaYouveGot >> GROUP BY country >> >> >> 1. https://en.wikipedia.org/wiki/Log-normal_distribution >> 2. https://en.wikipedia.org/wiki/Geometric_mean >> 3. See distribution.log_normal.svg >> (24K)<https://mail.google.com/mail/u/0/?ui=2&ik=1aecb4a505&view=att&th=1456ae573a3290c5&attid=0.3&disp=safe&realattid=f_hu2pcu3b2&zw> >> 4. See distribution.log_normal.logged.svg >> (33K)<https://mail.google.com/mail/u/0/?ui=2&ik=1aecb4a505&view=att&th=1456ae58ec7e1f69&attid=0.2&disp=safe&realattid=f_hu2pcu311&zw> >> >> -Aaron >> >> On Wed, Apr 16, 2014 at 8:42 AM, Dan Andreescu >> <[email protected]>wrote: >> >>> So, my latest idea for a solution is to write a python script that >>>>> will import the section (last X days) of data from the EventLogging tables >>>>> that we're interested in into a temporary sqlite database, then proceed >>>>> with removing the upper and lower percentiles of the data, according to >>>>> any >>>>> column grouping that might be necessary. And finally, once the data >>>>> preprocessing is done in sqlite, run similar queries as before to export >>>>> the mean, standard deviation, etc. for given metrics to tsvs. I think >>>>> using >>>>> sqlite is cleaner than doing the preprocessing on db1047 anyway. >>>>> >>>>> It's quite an undertaking, it basically means rewriting all our >>>>> current SQL => TSV conversion. The ability to use more steps in the >>>>> conversion means that we'd be able to have simpler, more readable SQL >>>>> queries. It would also be a good opportunity to clean up the giant >>>>> performance query with a bazillion JOINS: >>>>> https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e9e48c3866e8b2f4:perf/template.sqlwhich >>>>> can actually be divided into several data sources all used in the >>>>> same graph. >>>>> >>>>> Does that sound like a good idea, or is there a simpler solution out >>>>> there that someone can think of? >>>>> >>>> >>> Well, I think this sounds like we need to seriously evaluate how people >>> are using EventLogging data and provide this sort of analysis as a feature. >>> We'd have to hear from more people but I bet it's the right thing to do >>> long term. >>> >>> Meanwhile, "simple" is highly subjective here. If it was me, I'd clean >>> up the indentation of that giant SQL query you have, then maybe figure out >>> some ways to make it faster, then be happy as a clam. So if sql-lite is >>> the tool you feel happy as a clam with, then that sounds like a great >>> solution. Alternatives would be python, php, etc. I forgot if pandas was >>> allowed where you're working but that's a great python library that would >>> make what you're talking about fairly easy. >>> >>> Another thing for us to seriously consider is PostgreSQL. This has >>> proper f-ing temporary tables and supports actual people doing actual work >>> with databases. We could dump data, especially really simple schemas like >>> EventLogging, into PostgreSQL for analysis. >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
