(+ Eric)

On Mon, Jun 8, 2015 at 5:42 PM, Toby Negrin <[email protected]> wrote:

> As always, I'd recommend that we go with tech we are familiar with --
> mysql or cassandra. We have a cassandra committer on staff who would be
> able to answer these questions in detail.
>
> -Toby
>
> On Mon, Jun 8, 2015 at 4:46 PM, Marcel Ruiz Forns <[email protected]>
> wrote:
>
>> *This discussion is intended to be a branch of the thread: "[Analytics]
>> Pageview API Status update".*
>>
>> Hi all,
>>
>> We Analytics are trying to *choose a storage technology to keep the
>> pageview data* for analysis.
>>
>> We don't want to get to a final system that covers all our needs yet
>> (there are still things to discuss), but have something *that implements
>> the current stats.grok.se <http://stats.grok.se> functionalities* as a
>> first step. This way we can have a better grasp of which will be our
>> difficulties and limitations regarding performance and privacy.
>>
>> The objective of this thread is to *choose 3 storage technologies*. We
>> will later setup an fill each of them with 1 day of test data, evaluate
>> them and decide which one of them we will go for.
>>
>> There are 2 blocks of data to be stored:
>>
>>    1. *Cube that represents the number of pageviews broken down by the
>>    following dimensions*:
>>       - day/hour (size: 24)
>>       - project (size: 800)
>>       - agent type (size: 2)
>>
>> To test with an initial level of anonymity, all cube cells whose value is
>> less than k=100 have an undefined value. However, to be able to retrieve
>> aggregated values without loosing that undefined counts, all combinations
>> of slices and dices are precomputed before anonymization and belong to the
>> cube, too. Like this:
>>
>> dim1,  dim2,  dim3,  ...,  dimN,  val
>>    a,  null,  null,  ...,  null,   15    // pv for dim1=a
>>    a,     x,  null,  ...,  null,   34    // pv for dim1=a & dim2=x
>>    a,     x,     1,  ...,  null,   27    // pv for dim1=a & dim2=x &
>> dim3=1
>>    a,     x,     1,  ...,  true,  undef  // pv for dim1=a & dim2=x & ...
>> & dimN=true
>>
>> So the size of this dataset would be something between 100M and 200M
>> records per year, I think.
>>
>>
>>    1. *Timeseries dataset that stores the number of pageviews per
>>    article in time with*:
>>       - maximum resolution: hourly
>>       - diminishing resolution over time is accepted if there are
>>       performance problems
>>
>> article (dialect.project/article),       day/hour,   value
>>
>>            en.wikipedia/Main_page,  2015-01-01 17,  123456
>>
>>             en.wiktionary/Bazinga,  2015-01-02 13,   23456
>>
>> It's difficult to calculate the size of that. How many articles do we
>> have? 34M?
>> But not all of them will have pageviews every hour...
>>
>>
>>
>> *Note*: I guess we should consider that the storage system will
>> presumably have high volume batch inserts every hour or so, and queries
>> that will be a lot more frequent but also a lot lighter in data size.
>>
>> And that is that.
>> *So please, feel free to suggest storage technologies, comment, etc!*
>> And if there is any assumption I made in which you do not agree, please
>> comment also!
>>
>> I will start the thread with 2 suggestions:
>> 1) *PostgreSQL*: Seems to be able to handle the volume of the data and
>> knows how to implement diminishing resolution for timeseries.
>> 2) *Project Voldemort*: As we are denormalizing the cube entirely for
>> anonymity, the db doesn't need to compute aggregations, so it may well be a
>> key-value store.
>>
>> Cheers!
>>
>> Marcel
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Gabriel Wicke
Principal Engineer, Wikimedia Foundation
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to