Also - Ordered Partitioning can help HBASE to do row-scans... i.e. I can query with Partial Key and start a scan from there... But is that a requirement in Kylin? Say Slicing (or) so where some dimensions are kept constant and allowing other dimensions to vary? That sounds like a good usecase... But can some1 confirm?
On Thu, May 28, 2015 at 8:46 AM, Sarnath <[email protected]> wrote: > Thank you Andrew.... Can you tell how ordered partitioning is exploited by > Kylin? I want to know how the Cube is exposed via HBASE's ROWKEY and Column > Families. Can you somebody explain that? Thanks much. > > On Thu, May 28, 2015 at 3:03 AM, Andrew Purtell <[email protected]> > wrote: > >> HBase does not depend on Hive. >> >> If you want a CQL equivalent for HBase, you can use Apache Phoenix. >> >> Misunderstandings about HBase capabilities and options with respect to >> Cassandra are common. I suspect this is because of DataStax marketing. >> Cursory looks are ofen wrong. >> >> Given Kylin's goal to integrate well with Hadoop, an impartial assessment >> is very likely to conclude that use of Cassandra is suboptimal. Some >> reasons that come immediately to mind: Data stored in both HDFS and >> Cassandra's own storage will be redundant many times over due to >> replication in both storage systems. Cassandra lacks ordered partitioning >> as default, which Kylin is taking advantage of, and ordered partitioning >> in >> Cassandra comes with operational headaches. >> >> >> On Wed, May 27, 2015 at 1:44 AM, Sarnath <[email protected]> wrote: >> >> > Thanks for all your answers. I see the curse of dimensions - which can >> get >> > really bad when number of dimensions increases. What kind of >> optimizations >> > did you apply to reduce that? If you could name a prominent few - it >> will >> > be very useful knowledge. >> > >> > As far as HBASE - Are you using the Combination of Dimensions as RowKey >> for >> > HBASE? e.g. /ProductID=9739/Year=2015/Month=9/WeekOfDay=Monday can be a >> Row >> > Key to show the aggregation for all Mondays on September 2015 for >> Product >> > 9739. >> > >> > Is that a right way to think about how HBASE is being used? The >> > columns/column families can possibly represent different cubes. >> > >> > If the underlying data-store supports multi-dimensional maps - I think >> that >> > will be useful. Yes, HBASE is a multi-Dmap -- but those dimensions are >> > imposed by HBASE... i.e. Map<RowKey, ColumnFamily, Column, Time> >> > And that's limited. Our Cube can have a lot of dimensions. >> > >> > I am not an expert. But, from a very cursory look, Cassandra looks to >> be a >> > better bet. It has a query Language (CQL) (unlike HBASE which depends on >> > Hive which I hear is pretty slow). It looks like it can support map of >> map >> > of maps..... (nested tuples) which can come handy storing values of a >> Cube. >> > >> > I just want to get a conceptual understanding of how Kylin works. I hope >> > this discussion will help me get there. >> > >> > Thanks, >> > Best, >> > Sarnath >> > >> > On Wed, May 27, 2015 at 10:49 AM, 蒋旭 <[email protected]> wrote: >> > >> > > 1. Data cube is multi-dimensional array that is basically key-value >> data >> > > model. HBase is ordered key-value storage that is suitable for cube >> data >> > > model and query processing. >> > > 2. Kylin is focus on Hadoop. HBase is seamlessly integrate with MR, >> HDFS, >> > > HIVE. >> > > 3. HBase is scale out that is suitable to store large volume data set. >> > > 4. HBase coprocessor provide server-side parallel processing that is >> > > suitable for push-down computation and parallel the query processing. >> > > >> > > Thanks >> > > JiangXu >> > > ------------------ 原始邮件 ------------------ >> > > 发件人: hongbin ma <[email protected]> >> > > 发送时间: 2015年05月27日 12:47 >> > > 收件人: dev <[email protected]> >> > > 主题: Re: Choice of HBASE >> > > >> > > >> > > >> > > On Wed, May 27, 2015 at 12:35 PM, Sarnath <[email protected]> wrote: >> > > >> > > > Is it because Cube data can grow exponentially (2^N) with increasing >> > > > dimensions? >> > > > >> > > >> > > this is one of the most important reasons. We applied many >> optimization >> > to >> > > avoid curse of dimensions, but the cube size can still grow very >> large, >> > > especially when distinct count appears in metrics >> > > >> > > >> > > >> > > -- >> > > Regards, >> > > >> > > *Bin Mahone | 马洪宾* >> > > Apache Kylin: http://kylin.io >> > > Github: https://github.com/binmahone >> > > >> > >> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >> > >
