Re: [Analytics] [Technical] Pick storage for pageview cubes

Toby Negrin Fri, 12 Jun 2015 12:59:39 -0700

As someone who has run production serving systems on top of Hadoop, I think
this is risky. We've had substantial planned and unplanned downtime on the
cluster (which is to be expected) and it would be bad for a pageview API to
be impacted.


-Toby

On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto <[email protected]> wrote:

> I think we could add Impala in storage technologies to assess.
>
> I think we don’t want to build the pageview API on top of the Analytics
> Cluster.
>
>
>
>
> On Jun 12, 2015, at 05:37, Joseph Allemandou <[email protected]>
> wrote:
>
> I think we could add Impala in storage technologies to assess.
> It allows reading / computing straight from HDFS and should be fast enough
> for not too bad UEx.
> Maybe ?
>
>
> On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <[email protected]>
> wrote:
>
>> This thread seems to have paused for 1 or 2 days now.
>>
>> So summarizing, the following storage technologies have been mentioned:
>>
>>    - PostgreSQL
>>    - MySQL
>>    - Cassandra
>>    - Voldemort
>>
>> And the following concerns have been raised on using something that:
>>
>>    - We're already familiar with
>>    - Permits meta-analytics
>>    - Is queriable for json/tsv with little user setup
>>    - Withstands high throughput bulk inserts
>>    - Is queriable for slice and dice, even if we need to precompute those
>>
>> It seems that there aren't many candidates and that the discussion
>> focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one
>> of each type, say PostgreSQL and Cassandra?
>>
>> Or, anyone with more thoughts or suggestions?
>>
>>
>> On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <[email protected]>
>> wrote:
>>
>>> If we are going to completely denormalize the data sets for
>>> anonymization,
>>> and we expect just slice and dice queries to the database,
>>> I think we wouldn't take much advantage of a relational DB,
>>> because it wouldn't need to aggregate values, slice or dice,
>>> all slices and dices would be precomputed, right?
>>>
>>> It seems to me that the nature of this denormalized/anonymized data sets
>>> is more like a key-value store. That's why I suggested Voldemort at first
>>> (which, they say, has a slightly faster read than Cassandra), but I see the
>>> preference for Cassandra for it being a known tool inside WMF.
>>> So, +1 for Cassandra!
>>>
>>> However, if we foresee the need of adding more data sets to the same DB,
>>> or querying them in a different way, key-value store would be a limitation.
>>>
>>>
>>> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <[email protected]
>>> > wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected]>
>>>> wrote:
>>>>
>>>>> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Eric, I think we should allow arbitrary querying on any dimension for
>>>>>> that first data block.  We could pre-aggregate all of those combinations
>>>>>> pretty easily since the dimensions have very low cardinality.
>>>>>>
>>>>>
>>>>> Are you thinking about something like
>>>>> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more
>>>>> dimensions?
>>>>>
>>>>
>>>> only one more right now, called "agent_type".  But this is just the
>>>> first "cube" and we're planning a geo cube with more dimensions and are
>>>> probably going to try and release data split up by access method (mobile,
>>>> desktop, etc.) and other dimensions as people need them.  This will be
>>>> tricky as we try to protect privacy but that aside, the number of
>>>> dimensions per endpoint, right now, seems to hover around 4 or 5.
>>>>
>>>>
>>>>>
>>>>>
>>>>>> For the article-level data, no, we'd want just basic timeseries
>>>>>> querying.
>>>>>>
>>>>>> Thanks Gabriel, if you could point us to an example of these
>>>>>> secondary RESTBase indices, that'd be interesting.
>>>>>>
>>>>>
>>>>>  The API used to define these tables is described in
>>>>> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md,
>>>>> and the algorithm used to keep those indexes up to date is described in
>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md
>>>>> and largely implemented in
>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js
>>>>> .
>>>>>
>>>>
>>>> very cool, thx.
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> [email protected]
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Joseph Allemandou*
> Data Engineer @ Wikimedia Foundation
> IRC: joal
>  _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] Pick storage for pageview cubes

Reply via email to