OK, so I think we have our candidates: 1) PostgreSQL 2) Cassandra We can speak about this at our next tasking meeting. If someone has more suggestions or comments, we've still a couple days until then.
Thank you all! Marcel On Sat, Jun 13, 2015 at 11:37 AM, Joseph Allemandou < [email protected]> wrote: > Andrew, Toby, that makes perfect sense. > While thinking that the distributed aspect of Impala would handle high > availability issues, I very much understand that having a front-end system > relying on the analytics cluster is not as good as having a dedicated > storage solution. > Thanks for the good point :) > Joseph > > > On Fri, Jun 12, 2015 at 9:58 PM, Toby Negrin <[email protected]> > wrote: > >> As someone who has run production serving systems on top of Hadoop, I >> think this is risky. We've had substantial planned and unplanned downtime >> on the cluster (which is to be expected) and it would be bad for a pageview >> API to be impacted. >> >> -Toby >> >> On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto <[email protected]> wrote: >> >>> I think we could add Impala in storage technologies to assess. >>> >>> I think we don’t want to build the pageview API on top of the Analytics >>> Cluster. >>> >>> >>> >>> >>> On Jun 12, 2015, at 05:37, Joseph Allemandou <[email protected]> >>> wrote: >>> >>> I think we could add Impala in storage technologies to assess. >>> It allows reading / computing straight from HDFS and should be fast >>> enough for not too bad UEx. >>> Maybe ? >>> >>> >>> On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns < >>> [email protected]> wrote: >>> >>>> This thread seems to have paused for 1 or 2 days now. >>>> >>>> So summarizing, the following storage technologies have been mentioned: >>>> >>>> - PostgreSQL >>>> - MySQL >>>> - Cassandra >>>> - Voldemort >>>> >>>> And the following concerns have been raised on using something that: >>>> >>>> - We're already familiar with >>>> - Permits meta-analytics >>>> - Is queriable for json/tsv with little user setup >>>> - Withstands high throughput bulk inserts >>>> - Is queriable for slice and dice, even if we need to precompute >>>> those >>>> >>>> It seems that there aren't many candidates and that the discussion >>>> focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one >>>> of each type, say PostgreSQL and Cassandra? >>>> >>>> Or, anyone with more thoughts or suggestions? >>>> >>>> >>>> On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns < >>>> [email protected]> wrote: >>>> >>>>> If we are going to completely denormalize the data sets for >>>>> anonymization, >>>>> and we expect just slice and dice queries to the database, >>>>> I think we wouldn't take much advantage of a relational DB, >>>>> because it wouldn't need to aggregate values, slice or dice, >>>>> all slices and dices would be precomputed, right? >>>>> >>>>> It seems to me that the nature of this denormalized/anonymized data >>>>> sets is more like a key-value store. That's why I suggested Voldemort at >>>>> first (which, they say, has a slightly faster read than Cassandra), but I >>>>> see the preference for Cassandra for it being a known tool inside WMF. >>>>> So, +1 for Cassandra! >>>>> >>>>> However, if we foresee the need of adding more data sets to the same >>>>> DB, or querying them in a different way, key-value store would be a >>>>> limitation. >>>>> >>>>> >>>>> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu < >>>>> [email protected]> wrote: >>>>> >>>>>> >>>>>> >>>>>> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Eric, I think we should allow arbitrary querying on any dimension >>>>>>>> for that first data block. We could pre-aggregate all of those >>>>>>>> combinations pretty easily since the dimensions have very low >>>>>>>> cardinality. >>>>>>>> >>>>>>> >>>>>>> Are you thinking about something like >>>>>>> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more >>>>>>> dimensions? >>>>>>> >>>>>> >>>>>> only one more right now, called "agent_type". But this is just the >>>>>> first "cube" and we're planning a geo cube with more dimensions and are >>>>>> probably going to try and release data split up by access method (mobile, >>>>>> desktop, etc.) and other dimensions as people need them. This will be >>>>>> tricky as we try to protect privacy but that aside, the number of >>>>>> dimensions per endpoint, right now, seems to hover around 4 or 5. >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>>> For the article-level data, no, we'd want just basic timeseries >>>>>>>> querying. >>>>>>>> >>>>>>>> Thanks Gabriel, if you could point us to an example of these >>>>>>>> secondary RESTBase indices, that'd be interesting. >>>>>>>> >>>>>>> >>>>>>> The API used to define these tables is described in >>>>>>> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, >>>>>>> and the algorithm used to keep those indexes up to date is described in >>>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md >>>>>>> and largely implemented in >>>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js >>>>>>> . >>>>>>> >>>>>> >>>>>> very cool, thx. >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> [email protected] >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> [email protected] >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> >>> -- >>> *Joseph Allemandou* >>> Data Engineer @ Wikimedia Foundation >>> IRC: joal >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > > -- > *Joseph Allemandou* > Data Engineer @ Wikimedia Foundation > IRC: joal > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
