Re: [Analytics] [Technical] Pick storage for pageview cubes

Andrew Otto Fri, 12 Jun 2015 09:48:40 -0700

> I think we could add Impala in storage technologies to assess.
I think we don’t want to build the pageview API on top of the Analytics Cluster.





> On Jun 12, 2015, at 05:37, Joseph Allemandou <[email protected]> 
> wrote:
> 
> I think we could add Impala in storage technologies to assess.
> It allows reading / computing straight from HDFS and should be fast enough 
> for not too bad UEx.
> Maybe ?
> 
> 
> On Thu, Jun 11, 2015 at 11:11 PM, Marcel Ruiz Forns <[email protected] 
> <mailto:[email protected]>> wrote:
> This thread seems to have paused for 1 or 2 days now.
> 
> So summarizing, the following storage technologies have been mentioned:
> PostgreSQL
> MySQL
> Cassandra
> Voldemort
> And the following concerns have been raised on using something that:
> We're already familiar with
> Permits meta-analytics
> Is queriable for json/tsv with little user setup
> Withstands high throughput bulk inserts
> Is queriable for slice and dice, even if we need to precompute those
> It seems that there aren't many candidates and that the discussion focused on 
> SQL vs NoSQL, so what about choosing 2 stores instead of 3, one of each type, 
> say PostgreSQL and Cassandra?
> 
> Or, anyone with more thoughts or suggestions?
> 
> 
> On Wed, Jun 10, 2015 at 1:24 PM, Marcel Ruiz Forns <[email protected] 
> <mailto:[email protected]>> wrote:
> If we are going to completely denormalize the data sets for anonymization,
> and we expect just slice and dice queries to the database,
> I think we wouldn't take much advantage of a relational DB,
> because it wouldn't need to aggregate values, slice or dice,
> all slices and dices would be precomputed, right?
> 
> It seems to me that the nature of this denormalized/anonymized data sets is 
> more like a key-value store. That's why I suggested Voldemort at first 
> (which, they say, has a slightly faster read than Cassandra), but I see the 
> preference for Cassandra for it being a known tool inside WMF.
> So, +1 for Cassandra!
> 
> However, if we foresee the need of adding more data sets to the same DB, or 
> querying them in a different way, key-value store would be a limitation.
> 
> 
> On Wed, Jun 10, 2015 at 1:01 AM, Dan Andreescu <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> On Tue, Jun 9, 2015 at 5:23 PM, Gabriel Wicke <[email protected] 
> <mailto:[email protected]>> wrote:
> On Tue, Jun 9, 2015 at 11:53 AM, Dan Andreescu <[email protected] 
> <mailto:[email protected]>> wrote:
> Eric, I think we should allow arbitrary querying on any dimension for that 
> first data block.  We could pre-aggregate all of those combinations pretty 
> easily since the dimensions have very low cardinality. 
> 
> Are you thinking about something like 
> /{project|all}/{agent|all}/{day}/{hour}, or will there be a lot more 
> dimensions?
> 
> only one more right now, called "agent_type".  But this is just the first 
> "cube" and we're planning a geo cube with more dimensions and are probably 
> going to try and release data split up by access method (mobile, desktop, 
> etc.) and other dimensions as people need them.  This will be tricky as we 
> try to protect privacy but that aside, the number of dimensions per endpoint, 
> right now, seems to hover around 4 or 5.
>  
>  
> For the article-level data, no, we'd want just basic timeseries querying.
> 
> Thanks Gabriel, if you could point us to an example of these secondary 
> RESTBase indices, that'd be interesting.
> 
>  The API used to define these tables is described in 
> https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md 
> <https://github.com/wikimedia/restbase/blob/master/doc/TableStorageAPI.md>, 
> and the algorithm used to keep those indexes up to date is described in 
> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md
>  
> <https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/doc/SecondaryIndexes.md>
>  and largely implemented in 
> https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js
>  
> <https://github.com/wikimedia/restbase-mod-table-cassandra/blob/master/lib/secondaryIndexes.js>.
> 
> very cool, thx. 
> 
> _______________________________________________
> Analytics mailing list
> [email protected] <mailto:[email protected]>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected] <mailto:[email protected]>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> 
> 
> -- 
> Joseph Allemandou
> Data Engineer @ Wikimedia Foundation
> IRC: joal
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Technical] Pick storage for pageview cubes

Reply via email to