Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs

Fabrice Florin Mon, 21 Apr 2014 09:36:10 -0700

Thanks for the detailed response, Gilles!

I appreciate your willingness to keep in mind reports from users alongside the 
image load data we are collecting.


As you suggest, I will ask legal if we can collect email addresses of users who 
are willing to be contacted for follow up questions, so we can dig in a bit 
more about their performance issues.

I too would rather rely on actual data than anecdotal reports, but I want to 
make sure that the data is reliable. My own experience continues to show long 
load times that take seconds, not just milliseconds, on pages like these: 
https://hu.wikipedia.org/wiki/Wikip%C3%A9dia:A_nap_k%C3%A9pe#

For the purposes of calculating total image load from your dashboards, should 
we still be adding the API and image performance numbers? That would bring our 
different data points a bit closer to each other. :)

I look forward to learning more together about our average users' actual 
experience, which may require us to calibrate results from different methods 
until we have a good handle on this.

Onward!


Fabrice  
 

On Apr 21, 2014, at 2:33 AM, Gilles Dubuc <[email protected]> wrote:

> Are the stats reliable though? There is a huge jump a few days ago, even in 
> the file page loading times. Is that when it was switched over to Cloudbees?
> 
> Any data on that graph before Match 18th is junk that came from (often 
> partial) runs on my laptop, at times on internet connections of very 
> questionable quality.
> 
> March 18th onwards is exclusively run on cloudbees. You can see right away 
> that those cloudbees figures are a lot more stable.
> 
> When we have more data in a few days I'll update the SQL query to remove the 
> misleading figures that came from local development. In fact we should make 
> sure to avoid running this test locally against mediawiki.org or any 
> production wiki where EventLogging is turned on from now on, otherwise we'll 
> pollute the stats.
> 
> 
> On Mon, Apr 21, 2014 at 3:38 AM, Gergo Tisza <[email protected]> wrote:
> On Sun, Apr 20, 2014 at 3:39 AM, Gilles Dubuc <[email protected]> wrote:
> Any practical recommendations for addressing this concern?
> 
> Can the users who've been complaining about speed be contacted? That would 
> allow us to verify whether the bad experience is consistent for them, we 
> could measure it directly and even compare it to their general internet speed.
> 
> I started a separate thread about that; will also reach out to the users on 
> hu.wiki. Asking for email addresses in the survey would also be good, but we 
> should check if it has legal implications (collecting private data can be, 
> especially in the EU, a painful process).
>  
> And let's not forget that the status quo (opening the File: page) might be 
> just as slow for those people. They might just not realize it, because most 
> of the time spent loading that page shows you a blank tab. Now that the 
> "versus" test has been running on cloudbees for a couple of days, targeting 
> mediawiki.org, we can see that the file page is slower on average: 
> http://multimedia-metrics.wmflabs.org/dashboards/mmv#media_viewer_vs_file_page-graphs-tab
>  That wasn't the case a couple of weeks back, but we've made a number of 
> improvements since.
> 
> According to those stats, MediaViewer with a warm JS cache beats the file 
> page 2 to 1. That's pretty impressive!
> 
> Are the stats reliable though? There is a huge jump a few days ago, even in 
> the file page loading times. Is that when it was switched over to Cloudbees?
> 
> _______________________________________________
> Multimedia mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/multimedia
> 
> 
> _______________________________________________
> Multimedia mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/multimedia

_______________________________

On Apr 20, 2014, at 3:39 AM, Gilles Dubuc <[email protected]> wrote:

> Many images still take a much longer time to load in practice, as reported by 
> beta users around the world
> 
> Anecdotal evidence doesn't invalidate data collected directly by people's web 
> browsers. People's impression isn't as reliable as the data we're measuring. 
> The reason why we're collecting data this way is to that we can separate the 
> facts from the feeling people might have. Since we're talking about an 
> average, there are undeniably slower loads for certain people (soon shown as 
> histograms), but I don't see any reason to doubt the averages collected based 
> on people's comments.
> 
> For a dozen of people who felt the need to comment that it was slow for them, 
> there could have been hundreds or thousands who were satisfied and didn't say 
> a thing. In my experience people who are happy or unaffected by something are 
> a lot less likely to engage with a feedback survey.
> 
> Can we really assume that the mean image load time in India is 691 
> milliseconds?
> 
> Yes, that data is very real, for the API map, India's figures are calculated 
> over 12,209 measured requests, 5,158 unique IP addresses, none of which have 
> bot-like user agents strings.
> 
> But could there also be some bots or other traffic which could be distorting 
> the results?
> 
> Bots are valid concern, so I did some digging. Some bots masquerade as real 
> browsers (not serious search engines like google/yahoo, etc. which make up 
> most of the bot traffic), but since we're not seeing any non-masquerading 
> bots at all for India data, I seriously doubt there is any bot traffic at 
> this time that would impact the results for that country.
> 
> Looking at all countries, I only see 10 hits from a googlebot user agent 
> string, but with such a low amount it's hard to say if it really is a 
> googlebot (and not someone/something pretending to be it...).  In fact, given 
> the low bandwidth on those particular hits (24kb/s on an image load that was 
> a varnish hit) and the fact that their IPs appeared to come from Poland and 
> Bangladesh, I doubt it was really google.
> 
> While it's undeniable that rural areas one might visit during travels still 
> suffer from low internet speed, the majority of the world's population now 
> lives in cities: 
> http://www.un.org/en/development/desa/population/publications/urbanization/urban-rural.shtml
>  and the average broadband speed worldwide is probably much higher nowadays 
> than most people think: http://www.netindex.com/ And dial-up is rapidly 
> disappearing: 
> http://www.pewinternet.org/data-trend/internet-use/connection-type/ Slow 
> internet speed is a reality for a lot of people, but not for the majority of 
> people. I'm not surprised by the average results we're seeing. I agree that 
> this rapid change in recent years can be counter-intuitive when you're used 
> to traveling to rural locations.
> 
> Any practical recommendations for addressing this concern?
> 
> Can the users who've been complaining about speed be contacted? That would 
> allow us to verify whether the bad experience is consistent for them, we 
> could measure it directly and even compare it to their general internet speed.
> 
> As far as performance and stats improvements are concerned, we've been over 
> it several times and I think everything that could be done is already 
> implemented, filed or on its way.
> 
> And let's not forget that the status quo (opening the File: page) might be 
> just as slow for those people. They might just not realize it, because most 
> of the time spent loading that page shows you a blank tab. Now that the 
> "versus" test has been running on cloudbees for a couple of days, targeting 
> mediawiki.org, we can see that the file page is slower on average: 
> http://multimedia-metrics.wmflabs.org/dashboards/mmv#media_viewer_vs_file_page-graphs-tab
>  That wasn't the case a couple of weeks back, but we've made a number of 
> improvements since.
> 
> That's why I think it's important to do some real measurements on users that 
> bring up this issue. If we're not already doing it, we should encourage them 
> to optionally enter their email address for the purpose of investigating 
> issues further.
> 
> 
> On Sat, Apr 19, 2014 at 9:44 PM, Fabrice Florin <[email protected]> wrote:
> Thanks to everyone for this great teamwork!
> 
> The updated geographical performance dashboards which Gilles and Mark just 
> posted paint a more optimistic picture than before, which is encouraging:
> http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tab
> 
> However, these extremely fast load times do not match what we are hearing 
> from our users — or even our own experience on slower connections. Many 
> images still take a much longer time to load in practice, as reported by beta 
> users around the world, from Brazil to Hungary. 
> 
> Can we really assume that the mean image load time in India is 691 
> milliseconds? Seems way too fast, based on my experience traveling in Asia a 
> few weeks ago — where images could take a very long time to load, if at all. 
> 
> As Gergo pointed out, these early results may be because our first beta 
> testers may have some faster connections than average users. But could there 
> also be some bots or other traffic which could be distorting the results?
> 
> I know that we are working next on histograms that will give us a better 
> sense of how outliers are performing against average users. Can’t wait for 
> that.
> 
> But I am still concerned that this chart may be painting a much rosier 
> picture than what’s actually going on in the real world.
> 
> Any practical recommendations for addressing this concern? We want to know 
> what’s really happening for average users, so we can determine whether or not 
> regions with slow connections like India should consider making this feature 
> opt-in, rather than opt-out.
> 
> Thanks again to you all for helping us gain more clarity on this critical 
> issue :)
> 
> 
> Fabrice 
> 
> 
> On Apr 18, 2014, at 11:16 AM, Gilles Dubuc <[email protected]> wrote:
> 
>> Mark deployed the change, the mean and standard deviation on the "Overall 
>> network performance" and "Geographical network performance" tabs are now 
>> geometric:
>> 
>> http://multimedia-metrics.wmflabs.org/dashboards/mmv
>> 
>> These charts and maps now make a lot more sense! Next I'll be working on 
>> distribution histograms, so that we can see the outlier values that are now 
>> excluded from those graphs.
>> 
>> Thanks again Aaron, thanks to you these visualizations have become truly 
>> useful and meaningful, in the way they were meant to be.
>> 
>> 
>> On Thu, Apr 17, 2014 at 6:13 PM, Aaron Halfaker <[email protected]> 
>> wrote:
>> Yikes!  Good catch.  
>> 
>> 
>> On Thu, Apr 17, 2014 at 11:12 AM, Gilles Dubuc <[email protected]> wrote:
>> A solution to this problem is to generate a geometric mean[2] instead.
>> 
>> Thanks a lot for the help, it literally instantly solved my problem!
>> 
>> There was a small mistake in the order of functions in your example, for the 
>> record it should be:
>> 
>> EXP(AVG(LOG(event_total))) AS geometric_mean
>> 
>> And conveniently the geometric standard deviation can be calculated the same 
>> way:
>> 
>> EXP(STDDEV(LOG(event_total))) AS geometric_stddev
>> 
>> I put it to the test on a specific set of data where we had a huge outlier, 
>> and for that data it seems equivalent to excluding the lower and upper 10 
>> percentiles, which is exactly what I was after.
>> 
>> 
>> 
>> 
>> 
>> On Wed, Apr 16, 2014 at 4:24 PM, Aaron Halfaker <[email protected]> 
>> wrote:
>> Hi Gilles,
>> 
>> I think I know just the thing you're looking for.   
>> 
>> It turns out that much of this performance data is log-normally 
>> distributed[1].    Log-normal distributions tend to have a hockey stick 
>> shape where most of the values are close to zero, but occasionally very 
>> large values appear[3].  Taking the mean of a log-normal distributions tend 
>> to be sensitive to outliers like the ones you describe.  
>> 
>> A solution to this problem is to generate a geometric mean[2] instead.  One 
>> convenient thing about log-normal data is that if you log() it, it becomes 
>> normal[4] -- and not sensitive to outliers in the usual way.  Also 
>> convenient, geometric means are super easy to generate.  All you need to do 
>> is this: (1) pass all of the data through log() (2) pass the same data 
>> through mean() (or avg() -- whatever) (3) pass the result through exp().  
>> The best thing about this is that you can do it in MySQL.
>> 
>> For example:
>> 
>> SELECT
>>   country,
>>   mean(timings) AS regular_mean,
>>   exp(log(mean(timings)) AS geomteric_mean
>> FROM log.WhateverSchemaYouveGot
>> GROUP BY country
>> 
>> 1. https://en.wikipedia.org/wiki/Log-normal_distribution
>> 2. https://en.wikipedia.org/wiki/Geometric_mean
>> 3. See distribution.log_normal.svg (24K)
>> 4. See distribution.log_normal.logged.svg (33K)
>> 
>> -Aaron
>> 
>> On Wed, Apr 16, 2014 at 8:42 AM, Dan Andreescu <[email protected]> 
>> wrote:
>> So, my latest idea for a solution is to write a python script that will 
>> import the section (last X days) of data from the EventLogging tables that 
>> we're interested in into a temporary sqlite database, then proceed with 
>> removing the upper and lower percentiles of the data, according to any 
>> column grouping that might be necessary. And finally, once the data 
>> preprocessing is done in sqlite, run similar queries as before to export the 
>> mean, standard deviation, etc. for given metrics to tsvs. I think using 
>> sqlite is cleaner than doing the preprocessing on db1047 anyway.
>> 
>> It's quite an undertaking, it basically means rewriting all our current SQL 
>> => TSV conversion. The ability to use more steps in the conversion means 
>> that we'd be able to have simpler, more readable SQL queries. It would also 
>> be a good opportunity to clean up the giant performance query with a 
>> bazillion JOINS: 
>> https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e9e48c3866e8b2f4:perf/template.sql
>>  which can actually be divided into several data sources all used in the 
>> same graph.
>> 
>> Does that sound like a good idea, or is there a simpler solution out there 
>> that someone can think of?
>> 
>> Well, I think this sounds like we need to seriously evaluate how people are 
>> using EventLogging data and provide this sort of analysis as a feature.  
>> We'd have to hear from more people but I bet it's the right thing to do long 
>> term.
>> 
>> Meanwhile, "simple" is highly subjective here.  If it was me, I'd clean up 
>> the indentation of that giant SQL query you have, then maybe figure out some 
>> ways to make it faster, then be happy as a clam.  So if sql-lite is the tool 
>> you feel happy as a clam with, then that sounds like a great solution.  
>> Alternatives would be python, php, etc.  I forgot if pandas was allowed 
>> where you're working but that's a great python library that would make what 
>> you're talking about fairly easy.
>> 
>> Another thing for us to seriously consider is PostgreSQL.  This has proper 
>> f-ing temporary tables and supports actual people doing actual work with 
>> databases.  We could dump data, especially really simple schemas like 
>> EventLogging, into PostgreSQL for analysis.
>> 
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
>> 
>> 
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
>> 
>> 
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
>> 
>> 
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>> 
>> 
>> _______________________________________________
>> Multimedia mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/multimedia
> 
> _______________________________
> 
> Fabrice Florin
> Product Manager
> Wikimedia Foundation
> 
> http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)
> 
> 
> 
> 
> _______________________________________________
> Multimedia mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/multimedia
> 
> 
> _______________________________________________
> Multimedia mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/multimedia

_______________________________

Fabrice Florin
Product Manager
Wikimedia Foundation

http://en.wikipedia.org/wiki/User:Fabrice_Florin_(WMF)

_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs

Reply via email to