I think we could add Impala in storage technologies to assess.
It allows reading / computing straight from HDFS and should be fast enough
for not too bad UEx.
Maybe ?


On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <[email protected]>
wrote:

> This thread seems to have paused for 1 or 2 days now.
>
> So summarizing, the following storage technologies have been mentioned:
>
>    - PostgreSQL
>    - MySQL
>    - Cassandra
>    - Voldemort
>
> And the following concerns have been raised on using something that:
>
>    - We're already familiar with
>    - Permits meta-analytics
>    - Is queriable for json/tsv with little user setup
>    - Withstands high throughput bulk inserts
>    - Is queriable for slice and dice, even if we need to precompute those
>
> It seems that there aren't many candidates and that the discussion focused
> on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each
> type, say PostgreSQL and Cassandra?
>
> Or, anyone with more thoughts or suggestions?
>
>
> On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <[email protected]>
> wrote:
>
>> If we are going to completely denormalize the data sets for anonymization,
>> and we expect just slice and dice queries to the database,
>> I think we wouldn't take much advantage of a relational DB,
>> because it wouldn't need to aggregate values, slice or dice,
>> all slices and dices would be precomputed, right?
>>
>> It seems to me that the nature of this denormalized/anonymized data sets
>> is more like a key-value store. That's why I suggested Voldemort at first
>> (which, they say, has a slightly faster read than Cassandra), but I see the
>> preference for Cassandra for it being a known tool inside WMF.
>> So, +1 for Cassandra!
>>
>> However, if we foresee the need of adding more data sets to the same DB,
>> or querying them in a different way, key-value store would be a limitation.
>>
>>
>> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected]>
>>> wrote:
>>>
>>>> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <
>>>> [email protected]> wrote:
>>>>
>>>>> Eric, I think we should allow arbitrary querying on any dimension for
>>>>> that first data block.  We could pre-aggregate all of those combinations
>>>>> pretty easily since the dimensions have very low cardinality.
>>>>>
>>>>
>>>> Are you thinking about something like
>>>> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more
>>>> dimensions?
>>>>
>>>
>>> only one more right now, called "agent_type".  But this is just the
>>> first "cube" and we're planning a geo cube with more dimensions and are
>>> probably going to try and release data split up by access method (mobile,
>>> desktop, etc.) and other dimensions as people need them.  This will be
>>> tricky as we try to protect privacy but that aside, the number of
>>> dimensions per endpoint, right now, seems to hover around 4 or 5.
>>>
>>>
>>>>
>>>>
>>>>> For the article-level data, no, we'd want just basic timeseries
>>>>> querying.
>>>>>
>>>>> Thanks Gabriel, if you could point us to an example of these secondary
>>>>> RESTBase indices, that'd be interesting.
>>>>>
>>>>
>>>>  The API used to define these tables is described in
>>>> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md,
>>>> and the algorithm used to keep those indexes up to date is described in
>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md
>>>> and largely implemented in
>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js
>>>> .
>>>>
>>>
>>> very cool, thx.
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to