Re: [Analytics] [Technical] Pick storage for pageview cubes

Andrew Otto Mon, 08 Jun 2015 17:58:03 -0700

> As always, I'd recommend that we go with tech we are familiar with -- mysql 
> or cassandra. We have a cassandra committer on staff who would be able to 
> answer these questions in detail.


WMF uses PostGRES for some things, no?  Or is that is just in labs?




> On Jun 8, 2015, at 17:42, Toby Negrin <[email protected]> wrote:
> 
> As always, I'd recommend that we go with tech we are familiar with -- mysql 
> or cassandra. We have a cassandra committer on staff who would be able to 
> answer these questions in detail.
> 
> -Toby
> 
> On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns <[email protected] 
> <mailto:[email protected]>> wrote:
> This discussion is intended to be a branch of the thread: "[Analytics] 
> Pageview API Status update".
> 
> Hi all,
> 
> We Analytics are trying to choose a storage technology to keep the pageview 
> data for analysis.
> 
> We don't want to get to a final system that covers all our needs yet (there 
> are still things to discuss), but have something that implements the current 
> stats.grok.se <http://stats.grok.se/> functionalities as a first step. This 
> way we can have a better grasp of which will be our difficulties and 
> limitations regarding performance and privacy.
> 
> The objective of this thread is to choose 3 storage technologies. We will 
> later setup an fill each of them with 1 day of test data, evaluate them and 
> decide which one of them we will go for.
> 
> There are 2 blocks of data to be stored:
> Cube that represents the number of pageviews broken down by the following 
> dimensions:
> day/hour (size: 24)
> project (size: 800)
> agent type (size: 2)
> To test with an initial level of anonymity, all cube cells whose value is 
> less than k=100 have an undefined value. However, to be able to retrieve 
> aggregated values without loosing that undefined counts, all combinations of 
> slices and dices are precomputed before anonymization and belong to the cube, 
> too. Like this:
> 
> dim1,  dim2,  dim3,  ...,  dimN,  val
>    a,  null,  null,  ...,  null,   15    // pv for dim1=a
>    a,     x,  null,  ...,  null,   34    // pv for dim1=a & dim2=x
>    a,     x,     1,  ...,  null,   27    // pv for dim1=a & dim2=x & dim3=1
>    a,     x,     1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ... & 
> dimN=true
> 
> So the size of this dataset would be something between 100M and 200M records 
> per year, I think.
> 
> Timeseries dataset that stores the number of pageviews per article in time 
> with:
> maximum resolution: hourly
> diminishing resolution over time is accepted if there are performance problems
> article (dialect.project/article),       day/hour,   value
>            en.wikipedia/Main_page,  2015-01-01 17,  123456
>             en.wiktionary/Bazinga,  2015-01-02 13,   23456
> 
> It's difficult to calculate the size of that. How many articles do we have? 
> 34M?
> But not all of them will have pageviews every hour...
> 
> 
> Note: I guess we should consider that the storage system will presumably have 
> high volume batch inserts every hour or so, and queries that will be a lot 
> more frequent but also a lot lighter in data size.
> 
> And that is that.
> So please, feel free to suggest storage technologies, comment, etc!
> And if there is any assumption I made in which you do not agree, please 
> comment also!
> 
> I will start the thread with 2 suggestions:
> 1) PostgreSQL: Seems to be able to handle the volume of the data and knows 
> how to implement diminishing resolution for timeseries.
> 2) Project Voldemort: As we are denormalizing the cube entirely for 
> anonymity, the db doesn't need to compute aggregations, so it may well be a 
> key-value store.
> 
> Cheers!
> 
> Marcel
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected] <mailto:[email protected]>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] Pick storage for pageview cubes

Reply via email to