Re: [Analytics] Filtering out outliers in data used to generate tsvs

Aaron Halfaker Thu, 17 Apr 2014 09:14:38 -0700

Yikes!  Good catch.


On Thu, Apr 17, 2014 at 11:12 AM, Gilles Dubuc <[email protected]> wrote:

> A solution to this problem is to generate a geometric mean[2] instead.
>>
>
> Thanks a lot for the help, it literally instantly solved my problem!
>
> There was a small mistake in the order of functions in your example, for
> the record it should be:
>
> EXP(AVG(LOG(event_total))) AS geometric_mean
>
> And conveniently the geometric standard deviation can be calculated the
> same way:
>
> EXP(STDDEV(LOG(event_total))) AS geometric_stddev
>
> I put it to the test on a specific set of data where we had a huge
> outlier, and for that data it seems equivalent to excluding the lower and
> upper 10 percentiles, which is exactly what I was after.
>
>
>
>
>
> On Wed, Apr 16, 2014 at 4:24 PM, Aaron Halfaker 
> <[email protected]>wrote:
>
>> Hi Gilles,
>>
>> I think I know just the thing you're looking for.
>>
>> It turns out that much of this performance data is log-normally
>> distributed[1].    Log-normal distributions tend to have a hockey stick
>> shape where most of the values are close to zero, but occasionally very
>> large values appear[3].  Taking the mean of a log-normal distributions tend
>> to be sensitive to outliers like the ones you describe.
>>
>> A solution to this problem is to generate a geometric mean[2] instead.
>>  One convenient thing about log-normal data is that if you log() it, it
>> becomes normal[4] -- and not sensitive to outliers in the usual way.  Also
>> convenient, geometric means are super easy to generate.  All you need to do
>> is this: (1) pass all of the data through log() (2) pass the same data
>> through mean() (or avg() -- whatever) (3) pass the result through exp().
>>  The best thing about this is that you can do it in MySQL.
>>
>> For example:
>>
>> SELECT
>>   country,
>>   mean(timings) AS regular_mean,
>>   exp(log(mean(timings)) AS geomteric_mean
>> FROM log.WhateverSchemaYouveGot
>> GROUP BY country
>>
>>
>> 1. https://en.wikipedia.org/wiki/Log-normal_distribution
>> 2. https://en.wikipedia.org/wiki/Geometric_mean
>> 3. See distribution.log_normal.svg 
>> (24K)<https://mail.google.com/mail/u/0/?ui=2&ik=1aecb4a505&view=att&th=1456ae573a3290c5&attid=0.3&disp=safe&realattid=f_hu2pcu3b2&zw>
>> 4. See distribution.log_normal.logged.svg 
>> (33K)<https://mail.google.com/mail/u/0/?ui=2&ik=1aecb4a505&view=att&th=1456ae58ec7e1f69&attid=0.2&disp=safe&realattid=f_hu2pcu311&zw>
>>
>> -Aaron
>>
>> On Wed, Apr 16, 2014 at 8:42 AM, Dan Andreescu 
>> <[email protected]>wrote:
>>
>>>  So, my latest idea for a solution is to write a python script that
>>>>> will import the section (last X days) of data from the EventLogging tables
>>>>> that we're interested in into a temporary sqlite database, then proceed
>>>>> with removing the upper and lower percentiles of the data, according to 
>>>>> any
>>>>> column grouping that might be necessary. And finally, once the data
>>>>> preprocessing is done in sqlite, run similar queries as before to export
>>>>> the mean, standard deviation, etc. for given metrics to tsvs. I think 
>>>>> using
>>>>> sqlite is cleaner than doing the preprocessing on db1047 anyway.
>>>>>
>>>>> It's quite an undertaking, it basically means rewriting all our
>>>>> current SQL => TSV conversion. The ability to use more steps in the
>>>>> conversion means that we'd be able to have simpler, more readable SQL
>>>>> queries. It would also be a good opportunity to clean up the giant
>>>>> performance query with a bazillion JOINS:
>>>>> https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e9e48c3866e8b2f4:perf/template.sqlwhich
>>>>>  can actually be divided into several data sources all used in the
>>>>> same graph.
>>>>>
>>>>> Does that sound like a good idea, or is there a simpler solution out
>>>>> there that someone can think of?
>>>>>
>>>>
>>> Well, I think this sounds like we need to seriously evaluate how people
>>> are using EventLogging data and provide this sort of analysis as a feature.
>>>  We'd have to hear from more people but I bet it's the right thing to do
>>> long term.
>>>
>>> Meanwhile, "simple" is highly subjective here.  If it was me, I'd clean
>>> up the indentation of that giant SQL query you have, then maybe figure out
>>> some ways to make it faster, then be happy as a clam.  So if sql-lite is
>>> the tool you feel happy as a clam with, then that sounds like a great
>>> solution.  Alternatives would be python, php, etc.  I forgot if pandas was
>>> allowed where you're working but that's a great python library that would
>>> make what you're talking about fairly easy.
>>>
>>> Another thing for us to seriously consider is PostgreSQL.  This has
>>> proper f-ing temporary tables and supports actual people doing actual work
>>> with databases.  We could dump data, especially really simple schemas like
>>> EventLogging, into PostgreSQL for analysis.
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Filtering out outliers in data used to generate tsvs

Reply via email to