> As always, I'd recommend that we go with tech we are familiar with -- mysql > or cassandra. We have a cassandra committer on staff who would be able to > answer these questions in detail.
WMF uses PostGRES for some things, no? Or is that is just in labs? > On Jun 8, 2015, at 17:42, Toby Negrin <[email protected]> wrote: > > As always, I'd recommend that we go with tech we are familiar with -- mysql > or cassandra. We have a cassandra committer on staff who would be able to > answer these questions in detail. > > -Toby > > On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns <[email protected] > <mailto:[email protected]>> wrote: > This discussion is intended to be a branch of the thread: "[Analytics] > Pageview API Status update". > > Hi all, > > We Analytics are trying to choose a storage technology to keep the pageview > data for analysis. > > We don't want to get to a final system that covers all our needs yet (there > are still things to discuss), but have something that implements the current > stats.grok.se <http://stats.grok.se/> functionalities as a first step. This > way we can have a better grasp of which will be our difficulties and > limitations regarding performance and privacy. > > The objective of this thread is to choose 3 storage technologies. We will > later setup an fill each of them with 1 day of test data, evaluate them and > decide which one of them we will go for. > > There are 2 blocks of data to be stored: > Cube that represents the number of pageviews broken down by the following > dimensions: > day/hour (size: 24) > project (size: 800) > agent type (size: 2) > To test with an initial level of anonymity, all cube cells whose value is > less than k=100 have an undefined value. However, to be able to retrieve > aggregated values without loosing that undefined counts, all combinations of > slices and dices are precomputed before anonymization and belong to the cube, > too. Like this: > > dim1, dim2, dim3, ..., dimN, val > a, null, null, ..., null, 15 // pv for dim1=a > a, x, null, ..., null, 34 // pv for dim1=a & dim2=x > a, x, 1, ..., null, 27 // pv for dim1=a & dim2=x & dim3=1 > a, x, 1, ..., true, undef // pv for dim1=a & dim2=x & ... & > dimN=true > > So the size of this dataset would be something between 100M and 200M records > per year, I think. > > Timeseries dataset that stores the number of pageviews per article in time > with: > maximum resolution: hourly > diminishing resolution over time is accepted if there are performance problems > article (dialect.project/article), day/hour, value > en.wikipedia/Main_page, 2015-01-01 17, 123456 > en.wiktionary/Bazinga, 2015-01-02 13, 23456 > > It's difficult to calculate the size of that. How many articles do we have? > 34M? > But not all of them will have pageviews every hour... > > > Note: I guess we should consider that the storage system will presumably have > high volume batch inserts every hour or so, and queries that will be a lot > more frequent but also a lot lighter in data size. > > And that is that. > So please, feel free to suggest storage technologies, comment, etc! > And if there is any assumption I made in which you do not agree, please > comment also! > > I will start the thread with 2 suggestions: > 1) PostgreSQL: Seems to be able to handle the volume of the data and knows > how to implement diminishing resolution for timeseries. > 2) Project Voldemort: As we are denormalizing the cube entirely for > anonymity, the db doesn't need to compute aggregations, so it may well be a > key-value store. > > Cheers! > > Marcel > > > _______________________________________________ > Analytics mailing list > [email protected] <mailto:[email protected]> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
