> I think we could add Impala in storage technologies to assess. I think we don’t want to build the pageview API on top of the Analytics Cluster.
> On Jun 12, 2015, at 05:37, Joseph Allemandou <[email protected]> > wrote: > > I think we could add Impala in storage technologies to assess. > It allows reading / computing straight from HDFS and should be fast enough > for not too bad UEx. > Maybe ? > > > On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <[email protected] > <mailto:[email protected]>> wrote: > This thread seems to have paused for 1 or 2 days now. > > So summarizing, the following storage technologies have been mentioned: > PostgreSQL > MySQL > Cassandra > Voldemort > And the following concerns have been raised on using something that: > We're already familiar with > Permits meta-analytics > Is queriable for json/tsv with little user setup > Withstands high throughput bulk inserts > Is queriable for slice and dice, even if we need to precompute those > It seems that there aren't many candidates and that the discussion focused on > SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each type, > say PostgreSQL and Cassandra? > > Or, anyone with more thoughts or suggestions? > > > On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <[email protected] > <mailto:[email protected]>> wrote: > If we are going to completely denormalize the data sets for anonymization, > and we expect just slice and dice queries to the database, > I think we wouldn't take much advantage of a relational DB, > because it wouldn't need to aggregate values, slice or dice, > all slices and dices would be precomputed, right? > > It seems to me that the nature of this denormalized/anonymized data sets is > more like a key-value store. That's why I suggested Voldemort at first > (which, they say, has a slightly faster read than Cassandra), but I see the > preference for Cassandra for it being a known tool inside WMF. > So, +1 for Cassandra! > > However, if we foresee the need of adding more data sets to the same DB, or > querying them in a different way, key-value store would be a limitation. > > > On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <[email protected] > <mailto:[email protected]>> wrote: > > > On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected] > <mailto:[email protected]>> wrote: > On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <[email protected] > <mailto:[email protected]>> wrote: > Eric, I think we should allow arbitrary querying on any dimension for that > first data block. We could pre-aggregate all of those combinations pretty > easily since the dimensions have very low cardinality. > > Are you thinking about something like > /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more > dimensions? > > only one more right now, called "agent_type". But this is just the first > "cube" and we're planning a geo cube with more dimensions and are probably > going to try and release data split up by access method (mobile, desktop, > etc.) and other dimensions as people need them. This will be tricky as we > try to protect privacy but that aside, the number of dimensions per endpoint, > right now, seems to hover around 4 or 5. > > > For the article-level data, no, we'd want just basic timeseries querying. > > Thanks Gabriel, if you could point us to an example of these secondary > RESTBase indices, that'd be interesting. > > The API used to define these tables is described in > https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md > <https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md>, > and the algorithm used to keep those indexes up to date is described in > https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md > > <https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md> > and largely implemented in > https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js > > <https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js>. > > very cool, thx. > > _______________________________________________ > Analytics mailing list > [email protected] <mailto:[email protected]> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > > > _______________________________________________ > Analytics mailing list > [email protected] <mailto:[email protected]> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > > > -- > Joseph Allemandou > Data Engineer @ Wikimedia Foundation > IRC: joal > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
