Re: Choice of HBASE

Andrew Purtell Wed, 27 May 2015 14:34:21 -0700

HBase does not depend on Hive.

If you want a CQL equivalent for HBase, you can use Apache Phoenix.


Misunderstandings about HBase capabilities and options with respect to
Cassandra are common. I suspect this is because of DataStax marketing.
Cursory looks are ofen wrong.

Given Kylin's goal to integrate well with Hadoop, an impartial assessment
is very likely to conclude that use of Cassandra is suboptimal. Some
reasons that come immediately to mind: Data stored in both HDFS and
Cassandra's own storage will be redundant many times over due to
replication in both storage systems. Cassandra lacks ordered partitioning
as default, which Kylin is taking advantage of, and ordered partitioning in
Cassandra comes with operational headaches.


On Wed, May 27, 2015 at 1:44 AM, Sarnath <[email protected]> wrote:

> Thanks for all your answers. I see the curse of dimensions - which can get
> really bad when number of dimensions increases. What kind of optimizations
> did you apply to reduce that? If you could name a prominent few - it will
> be very useful knowledge.
>
> As far as HBASE - Are you using the Combination of Dimensions as RowKey for
> HBASE? e.g. /ProductID=9739/Year=2015/Month=9/WeekOfDay=Monday can be a Row
> Key to show the aggregation for all Mondays on September 2015 for Product
> 9739.
>
> Is that a right way to think about how HBASE is being used? The
> columns/column families can possibly represent different cubes.
>
> If the underlying data-store supports multi-dimensional maps - I think that
> will be useful. Yes, HBASE is a multi-Dmap -- but those dimensions are
> imposed by HBASE... i.e. Map<RowKey, ColumnFamily, Column, Time>
> And that's limited. Our Cube can have a lot of dimensions.
>
> I am not an expert. But, from a very cursory look, Cassandra looks to be a
> better bet. It has a query Language (CQL) (unlike HBASE which depends on
> Hive which I hear is pretty slow). It looks like it can support map of map
> of maps..... (nested tuples) which can come handy storing values of a Cube.
>
> I just want to get a conceptual understanding of how Kylin works. I hope
> this discussion will help me get there.
>
> Thanks,
> Best,
> Sarnath
>
> On Wed, May 27, 2015 at 10:49 AM, 蒋旭 <[email protected]> wrote:
>
> > 1. Data cube is multi-dimensional array that is basically key-value data
> > model. HBase is ordered key-value storage that is suitable for cube data
> > model and query processing.
> > 2. Kylin is focus on Hadoop. HBase is seamlessly integrate with MR, HDFS,
> > HIVE.
> > 3. HBase is scale out that is suitable to store large volume data set.
> > 4. HBase coprocessor provide server-side parallel processing that is
> > suitable for push-down computation and parallel the query processing.
> >
> > Thanks
> > JiangXu
> > ------------------ 原始邮件 ------------------
> > 发件人: hongbin ma <[email protected]>
> > 发送时间: 2015年05月27日 12:47
> > 收件人: dev <[email protected]>
> > 主题: Re: Choice of HBASE
> >
> >
> >
> > On Wed, May 27, 2015 at 12:35 PM, Sarnath <[email protected]> wrote:
> >
> > > Is it because Cube data can grow exponentially (2^N) with increasing
> > > dimensions?
> > >
> >
> > this is one of the most important reasons. We applied many optimization
> to
> > avoid curse of dimensions, but the cube size can still grow very large,
> > especially when distinct count appears in metrics
> >
> >
> >
> > --
> > Regards,
> >
> > *Bin Mahone | 马洪宾*
> > Apache Kylin: http://kylin.io
> > Github: https://github.com/binmahone
> >
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: Choice of HBASE

Reply via email to