[Analytics] [Technical] Pick storage for pageview cubes

Marcel Ruiz Forns Mon, 08 Jun 2015 16:47:09 -0700

*This discussion is intended to be a branch of the thread: "[Analytics]
Pageview API Status update".*


Hi all,

We Analytics are trying to *choose a storage technology to keep the
pageview data* for analysis.

We don't want to get to a final system that covers all our needs yet (there
are still things to discuss), but have something *that implements the
current stats.grok.se <http://stats.grok.se> functionalities* as a first
step. This way we can have a better grasp of which will be our difficulties
and limitations regarding performance and privacy.

The objective of this thread is to *choose 3 storage technologies*. We will
later setup an fill each of them with 1 day of test data, evaluate them and
decide which one of them we will go for.

There are 2 blocks of data to be stored:

   1. *Cube that represents the number of pageviews broken down by the
   following dimensions*:
      - day/hour (size: 24)
      - project (size: 800)
      - agent type (size: 2)

To test with an initial level of anonymity, all cube cells whose value is
less than k=100 have an undefined value. However, to be able to retrieve
aggregated values without loosing that undefined counts, all combinations
of slices and dices are precomputed before anonymization and belong to the
cube, too. Like this:

dim1,  dim2,  dim3,  ...,  dimN,  val
   a,  null,  null,  ...,  null,   15    // pv for dim1=a
   a,     x,  null,  ...,  null,   34    // pv for dim1=a & dim2=x
   a,     x,     1,  ...,  null,   27    // pv for dim1=a & dim2=x & dim3=1
   a,     x,     1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ... &
dimN=true

So the size of this dataset would be something between 100M and 200M
records per year, I think.


   1. *Timeseries dataset that stores the number of pageviews per article
   in time with*:
      - maximum resolution: hourly
      - diminishing resolution over time is accepted if there are
      performance problems

article (dialect.project/article),       day/hour,   value

           en.wikipedia/Main_page,  2015-01-01 17,  123456

            en.wiktionary/Bazinga,  2015-01-02 13,   23456

It's difficult to calculate the size of that. How many articles do we have?
34M?
But not all of them will have pageviews every hour...



*Note*: I guess we should consider that the storage system will presumably
have high volume batch inserts every hour or so, and queries that will be a
lot more frequent but also a lot lighter in data size.

And that is that.
*So please, feel free to suggest storage technologies, comment, etc!*
And if there is any assumption I made in which you do not agree, please
comment also!

I will start the thread with 2 suggestions:
1) *PostgreSQL*: Seems to be able to handle the volume of the data and
knows how to implement diminishing resolution for timeseries.
2) *Project Voldemort*: As we are denormalizing the cube entirely for
anonymity, the db doesn't need to compute aggregations, so it may well be a
key-value store.

Cheers!

Marcel

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] [Technical] Pick storage for pageview cubes

Reply via email to