Re: [Multimedia] Filtering out outliers in data used to generate tsvs

Gilles Dubuc Wed, 16 Apr 2014 04:28:27 -0700

Sending it to the right Analytics email address....


On Wed, Apr 16, 2014 at 1:22 PM, Gilles Dubuc <[email protected]> wrote:

> Including the analytics team in case they have a magical solution to our
> problem.
>
> Currently, our graphs display the mean and standard deviation of metrics,
> as provided in "mean" and "std" columns coming from our tsvs, generated
> based on EventLogging data:
> http://multimedia-metrics.wmflabs.org/dashboards/mmv  However we already
> see that extreme outliers can make the standard deviation and mean
> skyrocket and as a result make the graphs useless for some metrics. See
> France, for example, for which a single massive value was able to skew the
> map into making the country look problematic:
> http://multimedia-metrics.wmflabs.org/dashboards/mmv#geographical_network_performance-graphs-tabThere's
>  no performance issue with France, but the graph suggests that is
> the case because of that one outlier.
>
> Ideally, instead of using the mean for our graphs, we would be using what
> is called the "trimmed mean", i.e. the mean of all values excluding the
> upper and lower X percentiles. Unfortunately, MariaDB doesn't provide that
> as a function and calculating it with SQL can be surprisingly complicated,
> especially since we often have to group values for a given column. The best
> alternative I could come up with so far for our geographical queries was to
> exclude values that differ more than X times the standard deviation from
> the mean. It kind of flattens the mean. It's not ideal, because I think
> that in the context of our graphs it makes things look like they perform
> better than they really do.
>
> I think the main issue at the moment is that we're using a shell script to
> pipe a SQL request directly from db1047 to a tsv file. That limits us to
> one giant SQL query, and since we don't have the ability to create
> temporary tables on the log database with the research_prod user, we can't
> preprocess the data in multiple queries to filter out the upper and lower
> percentiles. The trimmed mean would be kind of feasible as a single
> complicated query if it wasn't for the GROUP BY:
> http://stackoverflow.com/a/8909568
>
> So, my latest idea for a solution is to write a python script that will
> import the section (last X days) of data from the EventLogging tables that
> we're interested in into a temporary sqlite database, then proceed with
> removing the upper and lower percentiles of the data, according to any
> column grouping that might be necessary. And finally, once the data
> preprocessing is done in sqlite, run similar queries as before to export
> the mean, standard deviation, etc. for given metrics to tsvs. I think using
> sqlite is cleaner than doing the preprocessing on db1047 anyway.
>
> It's quite an undertaking, it basically means rewriting all our current
> SQL => TSV conversion. The ability to use more steps in the conversion
> means that we'd be able to have simpler, more readable SQL queries. It
> would also be a good opportunity to clean up the giant performance query
> with a bazillion JOINS:
> https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e9e48c3866e8b2f4:perf/template.sqlwhich
>  can actually be divided into several data sources all used in the
> same graph.
>
> Does that sound like a good idea, or is there a simpler solution out there
> that someone can think of?
>
>
>

_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Re: [Multimedia] Filtering out outliers in data used to generate tsvs

Reply via email to