Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs

Fabrice Florin Sat, 19 Apr 2014 12:45:06 -0700

Thanks to everyone for this great teamwork!

The updated geographical performance dashboards which Gilles and Mark just 
posted paint a more optimistic picture than before, which is encouraging:
http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tab


However, these extremely fast load times do not match what we are hearing from 
our users — or even our own experience on slower connections. Many images still 
take a much longer time to load in practice, as reported by beta users around 
the world, from Brazil to Hungary. 

Can we really assume that the mean image load time in India is 691 
milliseconds? Seems way too fast, based on my experience traveling in Asia a 
few weeks ago — where images could take a very long time to load, if at all. 

As Gergo pointed out, these early results may be because our first beta testers 
may have some faster connections than average users. But could there also be 
some bots or other traffic which could be distorting the results?

I know that we are working next on histograms that will give us a better sense 
of how outliers are performing against average users. Can’t wait for that.

But I am still concerned that this chart may be painting a much rosier picture 
than what’s actually going on in the real world.

Any practical recommendations for addressing this concern? We want to know 
what’s really happening for average users, so we can determine whether or not 
regions with slow connections like India should consider making this feature 
opt-in, rather than opt-out.

Thanks again to you all for helping us gain more clarity on this critical issue 
:)


Fabrice 


On Apr 18, 2014, at 11:16 AM, Gilles Dubuc <[email protected]> wrote:

> Mark deployed the change, the mean and standard deviation on the "Overall 
> network performance" and "Geographical network performance" tabs are now 
> geometric:
> 
> http://multimedia-metrics.wmflabs.org/dashboards/mmv
> 
> These charts and maps now make a lot more sense! Next I'll be working on 
> distribution histograms, so that we can see the outlier values that are now 
> excluded from those graphs.
> 
> Thanks again Aaron, thanks to you these visualizations have become truly 
> useful and meaningful, in the way they were meant to be.
> 
> 
> On Thu, Apr 17, 2014 at 6:13 PM, Aaron Halfaker <[email protected]> 
> wrote:
> Yikes!  Good catch.  
> 
> 
> On Thu, Apr 17, 2014 at 11:12 AM, Gilles Dubuc <[email protected]> wrote:
> A solution to this problem is to generate a geometric mean[2] instead.
> 
> Thanks a lot for the help, it literally instantly solved my problem!
> 
> There was a small mistake in the order of functions in your example, for the 
> record it should be:
> 
> EXP(AVG(LOG(event_total))) AS geometric_mean
> 
> And conveniently the geometric standard deviation can be calculated the same 
> way:
> 
> EXP(STDDEV(LOG(event_total))) AS geometric_stddev
> 
> I put it to the test on a specific set of data where we had a huge outlier, 
> and for that data it seems equivalent to excluding the lower and upper 10 
> percentiles, which is exactly what I was after.
> 
> 
> 
> 
> 
> On Wed, Apr 16, 2014 at 4:24 PM, Aaron Halfaker <[email protected]> 
> wrote:
> Hi Gilles,
> 
> I think I know just the thing you're looking for.   
> 
> It turns out that much of this performance data is log-normally 
> distributed[1].    Log-normal distributions tend to have a hockey stick shape 
> where most of the values are close to zero, but occasionally very large 
> values appear[3].  Taking the mean of a log-normal distributions tend to be 
> sensitive to outliers like the ones you describe.  
> 
> A solution to this problem is to generate a geometric mean[2] instead.  One 
> convenient thing about log-normal data is that if you log() it, it becomes 
> normal[4] -- and not sensitive to outliers in the usual way.  Also 
> convenient, geometric means are super easy to generate.  All you need to do 
> is this: (1) pass all of the data through log() (2) pass the same data 
> through mean() (or avg() -- whatever) (3) pass the result through exp().  The 
> best thing about this is that you can do it in MySQL.
> 
> For example:
> 
> SELECT
>   country,
>   mean(timings) AS regular_mean,
>   exp(log(mean(timings)) AS geomteric_mean
> FROM log.WhateverSchemaYouveGot
> GROUP BY country
> 
> 1. https://en.wikipedia.org/wiki/Log-normal_distribution
> 2. https://en.wikipedia.org/wiki/Geometric_mean
> 3. See distribution.log_normal.svg (24K)
> 4. See distribution.log_normal.logged.svg (33K)
> 
> -Aaron
> 
> On Wed, Apr 16, 2014 at 8:42 AM, Dan Andreescu <[email protected]> 
> wrote:
> So, my latest idea for a solution is to write a python script that will 
> import the section (last X days) of data from the EventLogging tables that 
> we're interested in into a temporary sqlite database, then proceed with 
> removing the upper and lower percentiles of the data, according to any column 
> grouping that might be necessary. And finally, once the data preprocessing is 
> done in sqlite, run similar queries as before to export the mean, standard 
> deviation, etc. for given metrics to tsvs. I think using sqlite is cleaner 
> than doing the preprocessing on db1047 anyway.
> 
> It's quite an undertaking, it basically means rewriting all our current SQL 
> => TSV conversion. The ability to use more steps in the conversion means that 
> we'd be able to have simpler, more readable SQL queries. It would also be a 
> good opportunity to clean up the giant performance query with a bazillion 
> JOINS: 
> https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e9e48c3866e8b2f4:perf/template.sql
>  which can actually be divided into several data sources all used in the same 
> graph.
> 
> Does that sound like a good idea, or is there a simpler solution out there 
> that someone can think of?
> 
> Well, I think this sounds like we need to seriously evaluate how people are 
> using EventLogging data and provide this sort of analysis as a feature.  We'd 
> have to hear from more people but I bet it's the right thing to do long term.
> 
> Meanwhile, "simple" is highly subjective here.  If it was me, I'd clean up 
> the indentation of that giant SQL query you have, then maybe figure out some 
> ways to make it faster, then be happy as a clam.  So if sql-lite is the tool 
> you feel happy as a clam with, then that sounds like a great solution.  
> Alternatives would be python, php, etc.  I forgot if pandas was allowed where 
> you're working but that's a great python library that would make what you're 
> talking about fairly easy.
> 
> Another thing for us to seriously consider is PostgreSQL.  This has proper 
> f-ing temporary tables and supports actual people doing actual work with 
> databases.  We could dump data, especially really simple schemas like 
> EventLogging, into PostgreSQL for analysis.
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
> 
> 
> _______________________________________________
> Multimedia mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/multimedia

_______________________________

Fabrice Florin
Product Manager
Wikimedia Foundation

http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)

_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs

Reply via email to