Re: Choice of HBASE

Ted Dunning Wed, 27 May 2015 21:52:01 -0700

Kylin orders sub-cubes so that most scans will be very efficient.  The
exact details are unimportant (elements of a trie assigned integer keys),
but the point is that well-ordered range scans in kylin are crucially
important to performance.


Moreover, it is very, very common for one dimension to dominate queries in
OLAP situations.  That is enough to provide reasonable efficiency for
scans.  It is also possible to duplicate some cubes with alternative key
ordering to allow multiple scan orders, but I don't think that has been
necessary in practice.  This is hardly surprising since cubes are most
useful where query patterns are well understood.




On Wed, May 27, 2015 at 8:32 PM, Sarnath <[email protected]> wrote:

> Also - Ordered Partitioning can help HBASE to do row-scans... i.e. I can
> query with Partial Key and start a scan from there... But is that a
> requirement in Kylin? Say Slicing (or) so where some dimensions are kept
> constant and allowing other dimensions to vary? That sounds like a good
> usecase... But can some1 confirm?
>
> On Thu, May 28, 2015 at 8:46 AM, Sarnath <[email protected]> wrote:
>
> > Thank you Andrew.... Can you tell how ordered partitioning is exploited
> by
> > Kylin? I want to know how the Cube is exposed via HBASE's ROWKEY and
> Column
> > Families. Can you somebody explain that? Thanks much.
> >
> > On Thu, May 28, 2015 at 3:03 AM, Andrew Purtell <[email protected]>
> > wrote:
> >
> >> HBase does not depend on Hive.
> >>
> >> If you want a CQL equivalent for HBase, you can use Apache Phoenix.
> >>
> >> Misunderstandings about HBase capabilities and options with respect to
> >> Cassandra are common. I suspect this is because of DataStax marketing.
> >> Cursory looks are ofen wrong.
> >>
> >> Given Kylin's goal to integrate well with Hadoop, an impartial
> assessment
> >> is very likely to conclude that use of Cassandra is suboptimal. Some
> >> reasons that come immediately to mind: Data stored in both HDFS and
> >> Cassandra's own storage will be redundant many times over due to
> >> replication in both storage systems. Cassandra lacks ordered
> partitioning
> >> as default, which Kylin is taking advantage of, and ordered partitioning
> >> in
> >> Cassandra comes with operational headaches.
> >>
> >>
> >> On Wed, May 27, 2015 at 1:44 AM, Sarnath <[email protected]> wrote:
> >>
> >> > Thanks for all your answers. I see the curse of dimensions - which can
> >> get
> >> > really bad when number of dimensions increases. What kind of
> >> optimizations
> >> > did you apply to reduce that? If you could name a prominent few - it
> >> will
> >> > be very useful knowledge.
> >> >
> >> > As far as HBASE - Are you using the Combination of Dimensions as
> RowKey
> >> for
> >> > HBASE? e.g. /ProductID=9739/Year=2015/Month=9/WeekOfDay=Monday can be
> a
> >> Row
> >> > Key to show the aggregation for all Mondays on September 2015 for
> >> Product
> >> > 9739.
> >> >
> >> > Is that a right way to think about how HBASE is being used? The
> >> > columns/column families can possibly represent different cubes.
> >> >
> >> > If the underlying data-store supports multi-dimensional maps - I think
> >> that
> >> > will be useful. Yes, HBASE is a multi-Dmap -- but those dimensions are
> >> > imposed by HBASE... i.e. Map<RowKey, ColumnFamily, Column, Time>
> >> > And that's limited. Our Cube can have a lot of dimensions.
> >> >
> >> > I am not an expert. But, from a very cursory look, Cassandra looks to
> >> be a
> >> > better bet. It has a query Language (CQL) (unlike HBASE which depends
> on
> >> > Hive which I hear is pretty slow). It looks like it can support map of
> >> map
> >> > of maps..... (nested tuples) which can come handy storing values of a
> >> Cube.
> >> >
> >> > I just want to get a conceptual understanding of how Kylin works. I
> hope
> >> > this discussion will help me get there.
> >> >
> >> > Thanks,
> >> > Best,
> >> > Sarnath
> >> >
> >> > On Wed, May 27, 2015 at 10:49 AM, 蒋旭 <[email protected]> wrote:
> >> >
> >> > > 1. Data cube is multi-dimensional array that is basically key-value
> >> data
> >> > > model. HBase is ordered key-value storage that is suitable for cube
> >> data
> >> > > model and query processing.
> >> > > 2. Kylin is focus on Hadoop. HBase is seamlessly integrate with MR,
> >> HDFS,
> >> > > HIVE.
> >> > > 3. HBase is scale out that is suitable to store large volume data
> set.
> >> > > 4. HBase coprocessor provide server-side parallel processing that is
> >> > > suitable for push-down computation and parallel the query
> processing.
> >> > >
> >> > > Thanks
> >> > > JiangXu
> >> > > ------------------ 原始邮件 ------------------
> >> > > 发件人: hongbin ma <[email protected]>
> >> > > 发送时间: 2015年05月27日 12:47
> >> > > 收件人: dev <[email protected]>
> >> > > 主题: Re: Choice of HBASE
> >> > >
> >> > >
> >> > >
> >> > > On Wed, May 27, 2015 at 12:35 PM, Sarnath <[email protected]>
> wrote:
> >> > >
> >> > > > Is it because Cube data can grow exponentially (2^N) with
> increasing
> >> > > > dimensions?
> >> > > >
> >> > >
> >> > > this is one of the most important reasons. We applied many
> >> optimization
> >> > to
> >> > > avoid curse of dimensions, but the cube size can still grow very
> >> large,
> >> > > especially when distinct count appears in metrics
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Regards,
> >> > >
> >> > > *Bin Mahone | 马洪宾*
> >> > > Apache Kylin: http://kylin.io
> >> > > Github: https://github.com/binmahone
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >>    - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
> >
> >
>

Re: Choice of HBASE

Reply via email to