Andrew, Toby, that makes perfect sense. While thinking that the distributed aspect of Impala would handle high availability issues, I very much understand that having a front-end system relying on the analytics cluster is not as good as having a dedicated storage solution. Thanks for the good point :) Joseph
On Fri, Jun 12, 2015 at 9:58 PM, Toby Negrin <[email protected]> wrote: > As someone who has run production serving systems on top of Hadoop, I > think this is risky. We've had substantial planned and unplanned downtime > on the cluster (which is to be expected) and it would be bad for a pageview > API to be impacted. > > -Toby > > On Fri, Jun 12, 2015 at 9:46 AM, Andrew Otto <[email protected]> wrote: > >> I think we could add Impala in storage technologies to assess. >> >> I think we don’t want to build the pageview API on top of the Analytics >> Cluster. >> >> >> >> >> On Jun 12, 2015, at 05:37, Joseph Allemandou <[email protected]> >> wrote: >> >> I think we could add Impala in storage technologies to assess. >> It allows reading / computing straight from HDFS and should be fast >> enough for not too bad UEx. >> Maybe ? >> >> >> On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <[email protected] >> > wrote: >> >>> This thread seems to have paused for 1 or 2 days now. >>> >>> So summarizing, the following storage technologies have been mentioned: >>> >>> - PostgreSQL >>> - MySQL >>> - Cassandra >>> - Voldemort >>> >>> And the following concerns have been raised on using something that: >>> >>> - We're already familiar with >>> - Permits meta-analytics >>> - Is queriable for json/tsv with little user setup >>> - Withstands high throughput bulk inserts >>> - Is queriable for slice and dice, even if we need to precompute >>> those >>> >>> It seems that there aren't many candidates and that the discussion >>> focused on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one >>> of each type, say PostgreSQL and Cassandra? >>> >>> Or, anyone with more thoughts or suggestions? >>> >>> >>> On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <[email protected] >>> > wrote: >>> >>>> If we are going to completely denormalize the data sets for >>>> anonymization, >>>> and we expect just slice and dice queries to the database, >>>> I think we wouldn't take much advantage of a relational DB, >>>> because it wouldn't need to aggregate values, slice or dice, >>>> all slices and dices would be precomputed, right? >>>> >>>> It seems to me that the nature of this denormalized/anonymized data >>>> sets is more like a key-value store. That's why I suggested Voldemort at >>>> first (which, they say, has a slightly faster read than Cassandra), but I >>>> see the preference for Cassandra for it being a known tool inside WMF. >>>> So, +1 for Cassandra! >>>> >>>> However, if we foresee the need of adding more data sets to the same >>>> DB, or querying them in a different way, key-value store would be a >>>> limitation. >>>> >>>> >>>> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu < >>>> [email protected]> wrote: >>>> >>>>> >>>>> >>>>> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected]> >>>>> wrote: >>>>> >>>>>> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Eric, I think we should allow arbitrary querying on any dimension >>>>>>> for that first data block. We could pre-aggregate all of those >>>>>>> combinations pretty easily since the dimensions have very low >>>>>>> cardinality. >>>>>>> >>>>>> >>>>>> Are you thinking about something like >>>>>> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more >>>>>> dimensions? >>>>>> >>>>> >>>>> only one more right now, called "agent_type". But this is just the >>>>> first "cube" and we're planning a geo cube with more dimensions and are >>>>> probably going to try and release data split up by access method (mobile, >>>>> desktop, etc.) and other dimensions as people need them. This will be >>>>> tricky as we try to protect privacy but that aside, the number of >>>>> dimensions per endpoint, right now, seems to hover around 4 or 5. >>>>> >>>>> >>>>>> >>>>>> >>>>>>> For the article-level data, no, we'd want just basic timeseries >>>>>>> querying. >>>>>>> >>>>>>> Thanks Gabriel, if you could point us to an example of these >>>>>>> secondary RESTBase indices, that'd be interesting. >>>>>>> >>>>>> >>>>>> The API used to define these tables is described in >>>>>> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, >>>>>> and the algorithm used to keep those indexes up to date is described in >>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md >>>>>> and largely implemented in >>>>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js >>>>>> . >>>>>> >>>>> >>>>> very cool, thx. >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> [email protected] >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> >> -- >> *Joseph Allemandou* >> Data Engineer @ Wikimedia Foundation >> IRC: joal >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
