Re: [Analytics] [Technical] Pick storage for pageview cubes

Marcel Ruiz Forns Tue, 16 Jun 2015 16:48:02 -0700

OK, so I think we have our candidates:
1) PostgreSQL
2) Cassandra

We can speak about this at our next tasking meeting.
If someone has more suggestions or comments, we've still a couple days
until then.


Thank you all!

Marcel


On Sat, Jun 13, 2015 at 11:37 AM, Joseph Allemandou <
[email protected]> wrote:

> Andrew, Toby, that makes perfect sense.
> While thinking that the distributed aspect of Impala would handle high
> availability issues, I very much understand that having a front-end system
> relying on the analytics cluster is not as good as having a dedicated
> storage solution.
> Thanks for the good point :)
> Joseph
>
>
> On Fri, Jun 12, 2015 at 9:58 PM, Toby Negrin <[email protected]>
> wrote:
>
>> As someone who has run production serving systems on top of Hadoop, I
>> think this is risky. We've had substantial planned and unplanned downtime
>> on the cluster (which is to be expected) and it would be bad for a pageview
>> API to be impacted.
>>
>> -Toby
>>
>> On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto <[email protected]> wrote:
>>
>>> I think we could add Impala in storage technologies to assess.
>>>
>>> I think we don’t want to build the pageview API on top of the Analytics
>>> Cluster.
>>>
>>>
>>>
>>>
>>> On Jun 12, 2015, at 05:37, Joseph Allemandou <[email protected]>
>>> wrote:
>>>
>>> I think we could add Impala in storage technologies to assess.
>>> It allows reading / computing straight from HDFS and should be fast
>>> enough for not too bad UEx.
>>> Maybe ?
>>>
>>>
>>> On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <
>>> [email protected]> wrote:
>>>
>>>> This thread seems to have paused for 1 or 2 days now.
>>>>
>>>> So summarizing, the following storage technologies have been mentioned:
>>>>
>>>>    - PostgreSQL
>>>>    - MySQL
>>>>    - Cassandra
>>>>    - Voldemort
>>>>
>>>> And the following concerns have been raised on using something that:
>>>>
>>>>    - We're already familiar with
>>>>    - Permits meta-analytics
>>>>    - Is queriable for json/tsv with little user setup
>>>>    - Withstands high throughput bulk inserts
>>>>    - Is queriable for slice and dice, even if we need to precompute
>>>>    those
>>>>
>>>> It seems that there aren't many candidates and that the discussion
>>>> focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one
>>>> of each type, say PostgreSQL and Cassandra?
>>>>
>>>> Or, anyone with more thoughts or suggestions?
>>>>
>>>>
>>>> On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <
>>>> [email protected]> wrote:
>>>>
>>>>> If we are going to completely denormalize the data sets for
>>>>> anonymization,
>>>>> and we expect just slice and dice queries to the database,
>>>>> I think we wouldn't take much advantage of a relational DB,
>>>>> because it wouldn't need to aggregate values, slice or dice,
>>>>> all slices and dices would be precomputed, right?
>>>>>
>>>>> It seems to me that the nature of this denormalized/anonymized data
>>>>> sets is more like a key-value store. That's why I suggested Voldemort at
>>>>> first (which, they say, has a slightly faster read than Cassandra), but I
>>>>> see the preference for Cassandra for it being a known tool inside WMF.
>>>>> So, +1 for Cassandra!
>>>>>
>>>>> However, if we foresee the need of adding more data sets to the same
>>>>> DB, or querying them in a different way, key-value store would be a
>>>>> limitation.
>>>>>
>>>>>
>>>>> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Eric, I think we should allow arbitrary querying on any dimension
>>>>>>>> for that first data block.  We could pre-aggregate all of those
>>>>>>>> combinations pretty easily since the dimensions have very low 
>>>>>>>> cardinality.
>>>>>>>>
>>>>>>>
>>>>>>> Are you thinking about something like
>>>>>>> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more
>>>>>>> dimensions?
>>>>>>>
>>>>>>
>>>>>> only one more right now, called "agent_type".  But this is just the
>>>>>> first "cube" and we're planning a geo cube with more dimensions and are
>>>>>> probably going to try and release data split up by access method (mobile,
>>>>>> desktop, etc.) and other dimensions as people need them.  This will be
>>>>>> tricky as we try to protect privacy but that aside, the number of
>>>>>> dimensions per endpoint, right now, seems to hover around 4 or 5.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>> For the article-level data, no, we'd want just basic timeseries
>>>>>>>> querying.
>>>>>>>>
>>>>>>>> Thanks Gabriel, if you could point us to an example of these
>>>>>>>> secondary RESTBase indices, that'd be interesting.
>>>>>>>>
>>>>>>>
>>>>>>>  The API used to define these tables is described in
>>>>>>> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md,
>>>>>>> and the algorithm used to keep those indexes up to date is described in
>>>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md
>>>>>>> and largely implemented in
>>>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js
>>>>>>> .
>>>>>>>
>>>>>>
>>>>>> very cool, thx.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> [email protected]
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>>
>>> --
>>> *Joseph Allemandou*
>>> Data Engineer @ Wikimedia Foundation
>>> IRC: joal
>>>  _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Joseph Allemandou*
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] Pick storage for pageview cubes

Reply via email to