As always, I'd recommend that we go with tech we are familiar with -- mysql
or cassandra. We have a cassandra committer on staff who would be able to
answer these questions in detail.

-Toby

On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns <[email protected]>
wrote:

> *This discussion is intended to be a branch of the thread: "[Analytics]
> Pageview API Status update".*
>
> Hi all,
>
> We Analytics are trying to *choose a storage technology to keep the
> pageview data* for analysis.
>
> We don't want to get to a final system that covers all our needs yet
> (there are still things to discuss), but have something *that implements
> the current stats.grok.se <http://stats.grok.se> functionalities* as a
> first step. This way we can have a better grasp of which will be our
> difficulties and limitations regarding performance and privacy.
>
> The objective of this thread is to *choose 3 storage technologies*. We
> will later setup an fill each of them with 1 day of test data, evaluate
> them and decide which one of them we will go for.
>
> There are 2 blocks of data to be stored:
>
>    1. *Cube that represents the number of pageviews broken down by the
>    following dimensions*:
>       - day/hour (size: 24)
>       - project (size: 800)
>       - agent type (size: 2)
>
> To test with an initial level of anonymity, all cube cells whose value is
> less than k=100 have an undefined value. However, to be able to retrieve
> aggregated values without loosing that undefined counts, all combinations
> of slices and dices are precomputed before anonymization and belong to the
> cube, too. Like this:
>
> dim1,  dim2,  dim3,  ...,  dimN,  val
>    a,  null,  null,  ...,  null,   15    // pv for dim1=a
>    a,     x,  null,  ...,  null,   34    // pv for dim1=a & dim2=x
>    a,     x,     1,  ...,  null,   27    // pv for dim1=a & dim2=x &
> dim3=1
>    a,     x,     1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ...
> & dimN=true
>
> So the size of this dataset would be something between 100M and 200M
> records per year, I think.
>
>
>    1. *Timeseries dataset that stores the number of pageviews per article
>    in time with*:
>       - maximum resolution: hourly
>       - diminishing resolution over time is accepted if there are
>       performance problems
>
> article (dialect.project/article),       day/hour,   value
>
>            en.wikipedia/Main_page,  2015-01-01 17,  123456
>
>             en.wiktionary/Bazinga,  2015-01-02 13,   23456
>
> It's difficult to calculate the size of that. How many articles do we
> have? 34M?
> But not all of them will have pageviews every hour...
>
>
>
> *Note*: I guess we should consider that the storage system will
> presumably have high volume batch inserts every hour or so, and queries
> that will be a lot more frequent but also a lot lighter in data size.
>
> And that is that.
> *So please, feel free to suggest storage technologies, comment, etc!*
> And if there is any assumption I made in which you do not agree, please
> comment also!
>
> I will start the thread with 2 suggestions:
> 1) *PostgreSQL*: Seems to be able to handle the volume of the data and
> knows how to implement diminishing resolution for timeseries.
> 2) *Project Voldemort*: As we are denormalizing the cube entirely for
> anonymity, the db doesn't need to compute aggregations, so it may well be a
> key-value store.
>
> Cheers!
>
> Marcel
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to