I think we could add Impala in storage technologies to assess. It allows reading / computing straight from HDFS and should be fast enough for not too bad UEx. Maybe ?
On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <[email protected]> wrote: > This thread seems to have paused for 1 or 2 days now. > > So summarizing, the following storage technologies have been mentioned: > > - PostgreSQL > - MySQL > - Cassandra > - Voldemort > > And the following concerns have been raised on using something that: > > - We're already familiar with > - Permits meta-analytics > - Is queriable for json/tsv with little user setup > - Withstands high throughput bulk inserts > - Is queriable for slice and dice, even if we need to precompute those > > It seems that there aren't many candidates and that the discussion focused > on SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each > type, say PostgreSQL and Cassandra? > > Or, anyone with more thoughts or suggestions? > > > On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <[email protected]> > wrote: > >> If we are going to completely denormalize the data sets for anonymization, >> and we expect just slice and dice queries to the database, >> I think we wouldn't take much advantage of a relational DB, >> because it wouldn't need to aggregate values, slice or dice, >> all slices and dices would be precomputed, right? >> >> It seems to me that the nature of this denormalized/anonymized data sets >> is more like a key-value store. That's why I suggested Voldemort at first >> (which, they say, has a slightly faster read than Cassandra), but I see the >> preference for Cassandra for it being a known tool inside WMF. >> So, +1 for Cassandra! >> >> However, if we foresee the need of adding more data sets to the same DB, >> or querying them in a different way, key-value store would be a limitation. >> >> >> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <[email protected]> >> wrote: >> >>> >>> >>> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected]> >>> wrote: >>> >>>> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu < >>>> [email protected]> wrote: >>>> >>>>> Eric, I think we should allow arbitrary querying on any dimension for >>>>> that first data block. We could pre-aggregate all of those combinations >>>>> pretty easily since the dimensions have very low cardinality. >>>>> >>>> >>>> Are you thinking about something like >>>> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more >>>> dimensions? >>>> >>> >>> only one more right now, called "agent_type". But this is just the >>> first "cube" and we're planning a geo cube with more dimensions and are >>> probably going to try and release data split up by access method (mobile, >>> desktop, etc.) and other dimensions as people need them. This will be >>> tricky as we try to protect privacy but that aside, the number of >>> dimensions per endpoint, right now, seems to hover around 4 or 5. >>> >>> >>>> >>>> >>>>> For the article-level data, no, we'd want just basic timeseries >>>>> querying. >>>>> >>>>> Thanks Gabriel, if you could point us to an example of these secondary >>>>> RESTBase indices, that'd be interesting. >>>>> >>>> >>>> The API used to define these tables is described in >>>> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md, >>>> and the algorithm used to keep those indexes up to date is described in >>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md >>>> and largely implemented in >>>> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js >>>> . >>>> >>> >>> very cool, thx. >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
