(+ Eric) On Mon, Jun 8, 2015 at 5:42 PM, Toby Negrin <[email protected]> wrote:
> As always, I'd recommend that we go with tech we are familiar with -- > mysql or cassandra. We have a cassandra committer on staff who would be > able to answer these questions in detail. > > -Toby > > On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns <[email protected]> > wrote: > >> *This discussion is intended to be a branch of the thread: "[Analytics] >> Pageview API Status update".* >> >> Hi all, >> >> We Analytics are trying to *choose a storage technology to keep the >> pageview data* for analysis. >> >> We don't want to get to a final system that covers all our needs yet >> (there are still things to discuss), but have something *that implements >> the current stats.grok.se <http://stats.grok.se> functionalities* as a >> first step. This way we can have a better grasp of which will be our >> difficulties and limitations regarding performance and privacy. >> >> The objective of this thread is to *choose 3 storage technologies*. We >> will later setup an fill each of them with 1 day of test data, evaluate >> them and decide which one of them we will go for. >> >> There are 2 blocks of data to be stored: >> >> 1. *Cube that represents the number of pageviews broken down by the >> following dimensions*: >> - day/hour (size: 24) >> - project (size: 800) >> - agent type (size: 2) >> >> To test with an initial level of anonymity, all cube cells whose value is >> less than k=100 have an undefined value. However, to be able to retrieve >> aggregated values without loosing that undefined counts, all combinations >> of slices and dices are precomputed before anonymization and belong to the >> cube, too. Like this: >> >> dim1, dim2, dim3, ..., dimN, val >> a, null, null, ..., null, 15 // pv for dim1=a >> a, x, null, ..., null, 34 // pv for dim1=a & dim2=x >> a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & >> dim3=1 >> a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... >> & dimN=true >> >> So the size of this dataset would be something between 100M and 200M >> records per year, I think. >> >> >> 1. *Timeseries dataset that stores the number of pageviews per >> article in time with*: >> - maximum resolution: hourly >> - diminishing resolution over time is accepted if there are >> performance problems >> >> article (dialect.project/article), day/hour, value >> >> en.wikipedia/Main_page, 2015-01-01 17, 123456 >> >> en.wiktionary/Bazinga, 2015-01-02 13, 23456 >> >> It's difficult to calculate the size of that. How many articles do we >> have? 34M? >> But not all of them will have pageviews every hour... >> >> >> >> *Note*: I guess we should consider that the storage system will >> presumably have high volume batch inserts every hour or so, and queries >> that will be a lot more frequent but also a lot lighter in data size. >> >> And that is that. >> *So please, feel free to suggest storage technologies, comment, etc!* >> And if there is any assumption I made in which you do not agree, please >> comment also! >> >> I will start the thread with 2 suggestions: >> 1) *PostgreSQL*: Seems to be able to handle the volume of the data and >> knows how to implement diminishing resolution for timeseries. >> 2) *Project Voldemort*: As we are denormalizing the cube entirely for >> anonymity, the db doesn't need to compute aggregations, so it may well be a >> key-value store. >> >> Cheers! >> >> Marcel >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- Gabriel Wicke Principal Engineer, Wikimedia Foundation
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
