Nice Aaron!

> On Apr 16, 2014, at 7:26 AM, Aaron Halfaker <[email protected]> wrote:
> 
> The SVGs plots I made don't show up well in gmail, so here's some PNGs
> 
> 
>> On Wed, Apr 16, 2014 at 9:24 AM, Aaron Halfaker <[email protected]> 
>> wrote:
>> Hi Gilles,
>> 
>> I think I know just the thing you're looking for.   
>> 
>> It turns out that much of this performance data is log-normally 
>> distributed[1].    Log-normal distributions tend to have a hockey stick 
>> shape where most of the values are close to zero, but occasionally very 
>> large values appear[3].  Taking the mean of a log-normal distributions tend 
>> to be sensitive to outliers like the ones you describe.  
>> 
>> A solution to this problem is to generate a geometric mean[2] instead.  One 
>> convenient thing about log-normal data is that if you log() it, it becomes 
>> normal[4] -- and not sensitive to outliers in the usual way.  Also 
>> convenient, geometric means are super easy to generate.  All you need to do 
>> is this: (1) pass all of the data through log() (2) pass the same data 
>> through mean() (or avg() -- whatever) (3) pass the result through exp().  
>> The best thing about this is that you can do it in MySQL.
>> 
>> For example:
>> 
>> SELECT
>>   country,
>>   mean(timings) AS regular_mean,
>>   exp(log(mean(timings)) AS geomteric_mean
>> FROM log.WhateverSchemaYouveGot
>> GROUP BY country
>> 
>> 1. https://en.wikipedia.org/wiki/Log-normal_distribution
>> 2. https://en.wikipedia.org/wiki/Geometric_mean
>> 3. See distribution.log_normal.svg (24K)
>> 4. See distribution.log_normal.logged.svg (33K)
>> 
>> -Aaron
>> 
>> On Wed, Apr 16, 2014 at 8:42 AM, Dan Andreescu <[email protected]> 
>> wrote:
>>>>> So, my latest idea for a solution is to write a python script that will 
>>>>> import the section (last X days) of data from the EventLogging tables 
>>>>> that we're interested in into a temporary sqlite database, then proceed 
>>>>> with removing the upper and lower percentiles of the data, according to 
>>>>> any column grouping that might be necessary. And finally, once the data 
>>>>> preprocessing is done in sqlite, run similar queries as before to export 
>>>>> the mean, standard deviation, etc. for given metrics to tsvs. I think 
>>>>> using sqlite is cleaner than doing the preprocessing on db1047 anyway.
>>>>> 
>>>>> It's quite an undertaking, it basically means rewriting all our current 
>>>>> SQL => TSV conversion. The ability to use more steps in the conversion 
>>>>> means that we'd be able to have simpler, more readable SQL queries. It 
>>>>> would also be a good opportunity to clean up the giant performance query 
>>>>> with a bazillion JOINS: 
>>>>> https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e9e48c3866e8b2f4:perf/template.sql
>>>>>  which can actually be divided into several data sources all used in the 
>>>>> same graph.
>>>>> 
>>>>> Does that sound like a good idea, or is there a simpler solution out 
>>>>> there that someone can think of?
>>> 
>>> Well, I think this sounds like we need to seriously evaluate how people are 
>>> using EventLogging data and provide this sort of analysis as a feature.  
>>> We'd have to hear from more people but I bet it's the right thing to do 
>>> long term.
>>> 
>>> Meanwhile, "simple" is highly subjective here.  If it was me, I'd clean up 
>>> the indentation of that giant SQL query you have, then maybe figure out 
>>> some ways to make it faster, then be happy as a clam.  So if sql-lite is 
>>> the tool you feel happy as a clam with, then that sounds like a great 
>>> solution.  Alternatives would be python, php, etc.  I forgot if pandas was 
>>> allowed where you're working but that's a great python library that would 
>>> make what you're talking about fairly easy.
>>> 
>>> Another thing for us to seriously consider is PostgreSQL.  This has proper 
>>> f-ing temporary tables and supports actual people doing actual work with 
>>> databases.  We could dump data, especially really simple schemas like 
>>> EventLogging, into PostgreSQL for analysis.
>>> 
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> <distribution.log_normal.logged.png>
> <distribution.log_normal.png>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Reply via email to