Andrew, Toby, that makes perfect sense.
While thinking that the distributed aspect of Impala would handle high
availability issues, I very much understand that having a front-end system
relying on the analytics cluster is not as good as having a dedicated
storage solution.
Thanks for the good point :)
Joseph

On Fri, Jun 12, 2015 at 9:58 PM, Toby Negrin <[email protected]> wrote:

> As someone who has run production serving systems on top of Hadoop, I
> think this is risky. We've had substantial planned and unplanned downtime
> on the cluster (which is to be expected) and it would be bad for a pageview
> API to be impacted.
>
> -Toby
>
> On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto <[email protected]> wrote:
>
>> I think we could add Impala in storage technologies to assess.
>>
>> I think we don’t want to build the pageview API on top of the Analytics
>> Cluster.
>>
>>
>>
>>
>> On Jun 12, 2015, at 05:37, Joseph Allemandou <[email protected]>
>> wrote:
>>
>> I think we could add Impala in storage technologies to assess.
>> It allows reading / computing straight from HDFS and should be fast
>> enough for not too bad UEx.
>> Maybe ?
>>
>>
>> On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <[email protected]
>> > wrote:
>>
>>> This thread seems to have paused for 1 or 2 days now.
>>>
>>> So summarizing, the following storage technologies have been mentioned:
>>>
>>>    - PostgreSQL
>>>    - MySQL
>>>    - Cassandra
>>>    - Voldemort
>>>
>>> And the following concerns have been raised on using something that:
>>>
>>>    - We're already familiar with
>>>    - Permits meta-analytics
>>>    - Is queriable for json/tsv with little user setup
>>>    - Withstands high throughput bulk inserts
>>>    - Is queriable for slice and dice, even if we need to precompute
>>>    those
>>>
>>> It seems that there aren't many candidates and that the discussion
>>> focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one
>>> of each type, say PostgreSQL and Cassandra?
>>>
>>> Or, anyone with more thoughts or suggestions?
>>>
>>>
>>> On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <[email protected]
>>> > wrote:
>>>
>>>> If we are going to completely denormalize the data sets for
>>>> anonymization,
>>>> and we expect just slice and dice queries to the database,
>>>> I think we wouldn't take much advantage of a relational DB,
>>>> because it wouldn't need to aggregate values, slice or dice,
>>>> all slices and dices would be precomputed, right?
>>>>
>>>> It seems to me that the nature of this denormalized/anonymized data
>>>> sets is more like a key-value store. That's why I suggested Voldemort at
>>>> first (which, they say, has a slightly faster read than Cassandra), but I
>>>> see the preference for Cassandra for it being a known tool inside WMF.
>>>> So, +1 for Cassandra!
>>>>
>>>> However, if we foresee the need of adding more data sets to the same
>>>> DB, or querying them in a different way, key-value store would be a
>>>> limitation.
>>>>
>>>>
>>>> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Eric, I think we should allow arbitrary querying on any dimension
>>>>>>> for that first data block.  We could pre-aggregate all of those
>>>>>>> combinations pretty easily since the dimensions have very low 
>>>>>>> cardinality.
>>>>>>>
>>>>>>
>>>>>> Are you thinking about something like
>>>>>> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more
>>>>>> dimensions?
>>>>>>
>>>>>
>>>>> only one more right now, called "agent_type".  But this is just the
>>>>> first "cube" and we're planning a geo cube with more dimensions and are
>>>>> probably going to try and release data split up by access method (mobile,
>>>>> desktop, etc.) and other dimensions as people need them.  This will be
>>>>> tricky as we try to protect privacy but that aside, the number of
>>>>> dimensions per endpoint, right now, seems to hover around 4 or 5.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>> For the article-level data, no, we'd want just basic timeseries
>>>>>>> querying.
>>>>>>>
>>>>>>> Thanks Gabriel, if you could point us to an example of these
>>>>>>> secondary RESTBase indices, that'd be interesting.
>>>>>>>
>>>>>>
>>>>>>  The API used to define these tables is described in
>>>>>> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md,
>>>>>> and the algorithm used to keep those indexes up to date is described in
>>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md
>>>>>> and largely implemented in
>>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js
>>>>>> .
>>>>>>
>>>>>
>>>>> very cool, thx.
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> [email protected]
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> *Joseph Allemandou*
>> Data Engineer @ Wikimedia Foundation
>> IRC: joal
>>  _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to