Sending it to the right Analytics email address....
On Wed, Apr 16, 2014 at 1:22 PM, Gilles Dubuc <[email protected]> wrote: > Including the analytics team in case they have a magical solution to our > problem. > > Currently, our graphs display the mean and standard deviation of metrics, > as provided in "mean" and "std" columns coming from our tsvs, generated > based on EventLogging data: > http://multimedia-metrics.wmflabs.org/dashboards/mmv However we already > see that extreme outliers can make the standard deviation and mean > skyrocket and as a result make the graphs useless for some metrics. See > France, for example, for which a single massive value was able to skew the > map into making the country look problematic: > http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tabThere's > no performance issue with France, but the graph suggests that is > the case because of that one outlier. > > Ideally, instead of using the mean for our graphs, we would be using what > is called the "trimmed mean", i.e. the mean of all values excluding the > upper and lower X percentiles. Unfortunately, MariaDB doesn't provide that > as a function and calculating it with SQL can be surprisingly complicated, > especially since we often have to group values for a given column. The best > alternative I could come up with so far for our geographical queries was to > exclude values that differ more than X times the standard deviation from > the mean. It kind of flattens the mean. It's not ideal, because I think > that in the context of our graphs it makes things look like they perform > better than they really do. > > I think the main issue at the moment is that we're using a shell script to > pipe a SQL request directly from db1047 to a tsv file. That limits us to > one giant SQL query, and since we don't have the ability to create > temporary tables on the log database with the research_prod user, we can't > preprocess the data in multiple queries to filter out the upper and lower > percentiles. The trimmed mean would be kind of feasible as a single > complicated query if it wasn't for the GROUP BY: > http://stackoverflow.com/a/8909568 > > So, my latest idea for a solution is to write a python script that will > import the section (last X days) of data from the EventLogging tables that > we're interested in into a temporary sqlite database, then proceed with > removing the upper and lower percentiles of the data, according to any > column grouping that might be necessary. And finally, once the data > preprocessing is done in sqlite, run similar queries as before to export > the mean, standard deviation, etc. for given metrics to tsvs. I think using > sqlite is cleaner than doing the preprocessing on db1047 anyway. > > It's quite an undertaking, it basically means rewriting all our current > SQL => TSV conversion. The ability to use more steps in the conversion > means that we'd be able to have simpler, more readable SQL queries. It > would also be a good opportunity to clean up the giant performance query > with a bazillion JOINS: > https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e9e48c3866e8b2f4:perf/template.sqlwhich > can actually be divided into several data sources all used in the > same graph. > > Does that sound like a good idea, or is there a simpler solution out there > that someone can think of? > > >
_______________________________________________ Multimedia mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/multimedia
