Thanks for all your answers. I see the curse of dimensions - which can get really bad when number of dimensions increases. What kind of optimizations did you apply to reduce that? If you could name a prominent few - it will be very useful knowledge.
As far as HBASE - Are you using the Combination of Dimensions as RowKey for HBASE? e.g. /ProductID=9739/Year=2015/Month=9/WeekOfDay=Monday can be a Row Key to show the aggregation for all Mondays on September 2015 for Product 9739. Is that a right way to think about how HBASE is being used? The columns/column families can possibly represent different cubes. If the underlying data-store supports multi-dimensional maps - I think that will be useful. Yes, HBASE is a multi-Dmap -- but those dimensions are imposed by HBASE... i.e. Map<RowKey, ColumnFamily, Column, Time> And that's limited. Our Cube can have a lot of dimensions. I am not an expert. But, from a very cursory look, Cassandra looks to be a better bet. It has a query Language (CQL) (unlike HBASE which depends on Hive which I hear is pretty slow). It looks like it can support map of map of maps..... (nested tuples) which can come handy storing values of a Cube. I just want to get a conceptual understanding of how Kylin works. I hope this discussion will help me get there. Thanks, Best, Sarnath On Wed, May 27, 2015 at 10:49 AM, 蒋旭 <[email protected]> wrote: > 1. Data cube is multi-dimensional array that is basically key-value data > model. HBase is ordered key-value storage that is suitable for cube data > model and query processing. > 2. Kylin is focus on Hadoop. HBase is seamlessly integrate with MR, HDFS, > HIVE. > 3. HBase is scale out that is suitable to store large volume data set. > 4. HBase coprocessor provide server-side parallel processing that is > suitable for push-down computation and parallel the query processing. > > Thanks > JiangXu > ------------------ 原始邮件 ------------------ > 发件人: hongbin ma <[email protected]> > 发送时间: 2015年05月27日 12:47 > 收件人: dev <[email protected]> > 主题: Re: Choice of HBASE > > > > On Wed, May 27, 2015 at 12:35 PM, Sarnath <[email protected]> wrote: > > > Is it because Cube data can grow exponentially (2^N) with increasing > > dimensions? > > > > this is one of the most important reasons. We applied many optimization to > avoid curse of dimensions, but the cube size can still grow very large, > especially when distinct count appears in metrics > > > > -- > Regards, > > *Bin Mahone | 马洪宾* > Apache Kylin: http://kylin.io > Github: https://github.com/binmahone >
