Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs

Ori Livneh Sun, 20 Apr 2014 20:21:13 -0700

On Wed, Apr 16, 2014 at 6:42 AM, Dan Andreescu <[email protected]>wrote:


>  So, my latest idea for a solution is to write a python script that will
>>> import the section (last X days) of data from the EventLogging tables that
>>> we're interested in into a temporary sqlite database, then proceed with
>>> removing the upper and lower percentiles of the data, according to any
>>> column grouping that might be necessary. And finally, once the data
>>> preprocessing is done in sqlite, run similar queries as before to export
>>> the mean, standard deviation, etc. for given metrics to tsvs. I think using
>>> sqlite is cleaner than doing the preprocessing on db1047 anyway.
>>>
>>> It's quite an undertaking, it basically means rewriting all our current
>>> SQL => TSV conversion. The ability to use more steps in the conversion
>>> means that we'd be able to have simpler, more readable SQL queries. It
>>> would also be a good opportunity to clean up the giant performance query
>>> with a bazillion JOINS:
>>> https://gitorious.org/analytics/multimedia/source/a949b1c8723c4c41700cedf6e9e48c3866e8b2f4:perf/template.sqlwhich
>>>  can actually be divided into several data sources all used in the
>>> same graph.
>>>
>>> Does that sound like a good idea, or is there a simpler solution out
>>> there that someone can think of?
>>>
>>
> Well, I think this sounds like we need to seriously evaluate how people
> are using EventLogging data and provide this sort of analysis as a feature.
>  We'd have to hear from more people but I bet it's the right thing to do
> long term.
>

Yes, I think so too. We used to pipe the data into a MongoDB instance
(besides being the butt of webscale jokes, MongoDB does have an expressive
query framework) but I wasn't able to convince anyone to use it, so I
quietly decommissioned it at one point. It has been missed by no one.

When I have to produce summary statistics these days, I use Pandas and
IPython, either reading the data directly from db1047 to a DataFrame object
or dumping the table to a CSV file first and using that. (See <
http://pandas.pydata.org/pandas-docs/stable/io.html>)

It'd be nice if stat1001 (or whatevever) had simple packages in Python and
R (and maybe Julia) that knew about db1047 and the credentials required for
accessing it, and provided some simple API for retrieving EventLogging
datasets. I used to have this set up with IPython (I think Aaron and Dario
used it briefly) but I didn't have the chance to polish it and drive
adoption.

The other database backends that I think are worth considering are
RethinkDB (document-oriented, <http://rethinkdb.com/>) and BayesDB
(table-oriented, <http://probcomp.csail.mit.edu/bayesdb/>).

RethinkDB has the nicest query interface of all the document-oriented
databases I know of. See <
http://rethinkdb.com/docs/cookbook/python/#computing-the-average-value-of-a-field>
to get a taste.

Postgres's support for column-oriented data is really interesting, but my
hunch is that it won't provide enough of an edge over MariaDB to win people
over. Everyone is familiar with MariaDB / MySQL's dialect of SQL, and
you're already forced to use it by dint of it being the primary backend of
Wikimedia wikis.

I am still really excited by this, and it remains an open requirement for a
lot of performance analysis work, so if there was interest from the
analytics team, I'd be very happy to collaborate. The pubsub setup of
EventLogging makes it very easy to pipe the data into a new medium without
perturbing the existing setup.

_______________________________________________
Multimedia mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/multimedia

Re: [Multimedia] [Analytics] Filtering out outliers in data used to generate tsvs

Reply via email to