As always, I'd recommend that we go with tech we are familiar with -- mysql or cassandra. We have a cassandra committer on staff who would be able to answer these questions in detail.
-Toby On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns <[email protected]> wrote: > *This discussion is intended to be a branch of the thread: "[Analytics] > Pageview API Status update".* > > Hi all, > > We Analytics are trying to *choose a storage technology to keep the > pageview data* for analysis. > > We don't want to get to a final system that covers all our needs yet > (there are still things to discuss), but have something *that implements > the current stats.grok.se <http://stats.grok.se> functionalities* as a > first step. This way we can have a better grasp of which will be our > difficulties and limitations regarding performance and privacy. > > The objective of this thread is to *choose 3 storage technologies*. We > will later setup an fill each of them with 1 day of test data, evaluate > them and decide which one of them we will go for. > > There are 2 blocks of data to be stored: > > 1. *Cube that represents the number of pageviews broken down by the > following dimensions*: > - day/hour (size: 24) > - project (size: 800) > - agent type (size: 2) > > To test with an initial level of anonymity, all cube cells whose value is > less than k=100 have an undefined value. However, to be able to retrieve > aggregated values without loosing that undefined counts, all combinations > of slices and dices are precomputed before anonymization and belong to the > cube, too. Like this: > > dim1, dim2, dim3, ..., dimN, val > a, null, null, ..., null, 15 // pv for dim1=a > a, x, null, ..., null, 34 // pv for dim1=a & dim2=x > a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & > dim3=1 > a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... > & dimN=true > > So the size of this dataset would be something between 100M and 200M > records per year, I think. > > > 1. *Timeseries dataset that stores the number of pageviews per article > in time with*: > - maximum resolution: hourly > - diminishing resolution over time is accepted if there are > performance problems > > article (dialect.project/article), day/hour, value > > en.wikipedia/Main_page, 2015-01-01 17, 123456 > > en.wiktionary/Bazinga, 2015-01-02 13, 23456 > > It's difficult to calculate the size of that. How many articles do we > have? 34M? > But not all of them will have pageviews every hour... > > > > *Note*: I guess we should consider that the storage system will > presumably have high volume batch inserts every hour or so, and queries > that will be a lot more frequent but also a lot lighter in data size. > > And that is that. > *So please, feel free to suggest storage technologies, comment, etc!* > And if there is any assumption I made in which you do not agree, please > comment also! > > I will start the thread with 2 suggestions: > 1) *PostgreSQL*: Seems to be able to handle the volume of the data and > knows how to implement diminishing resolution for timeseries. > 2) *Project Voldemort*: As we are denormalizing the cube entirely for > anonymity, the db doesn't need to compute aggregations, so it may well be a > key-value store. > > Cheers! > > Marcel > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
