Re: Group by + where clause

Sarnath Sun, 15 Nov 2015 03:47:55 -0800

Hi,
We have an internally developed cube engine. We use elastic search as
storage. For the experiment below, ES was almost 5x to 8x faster than
kylin. (60ms vs 240ms/450ms)... But then we use ES search REST interface
directly and compare that with kylin's REST interface. So I am not sure
about the SQL translation...overheads in kylin. And ES did not have the
row-key order based fluctuations... The developer also reported pretty less
storage... But I would cross check that before I can tell anything.
Let me cross check everything next week and see if I can publish a small
report. We only have a modest infra.. So I don't know what would be the
behavior at scale....
On Nov 15, 2015 12:30 PM, "ShaoFeng Shi" <[email protected]> wrote:


> Yes it is expected, and I think this is a balance between space
> and performance; usually we put the more-frequent filtered column before
> the low-frequent column on row key, that's just this purpose.
>
> I'm not sure whether other K-V storage can provide more power on this; Now
> Kylin has refactored to a plug-in architecture, which makes it possible to
> use other storage for cube; if you have any idea or suggestion please share
> with us.
>
> 2015-11-15 0:44 GMT+08:00 Sarnath <[email protected]>:
>
> > Hi ShaoFeng Shi,
> >
> > Thanks for the info... Yes, I meant the Cuboid when I referred Segment..
> I
> > did not know Segment is a separate keyword in Kylin.
> > We ran a simple experiment on this and found that this is indeed the
> case.
> > We created a Product,Branch cuboid and ran queries projecting
> > Product,Branch and Aggregations while filtering on Product or a
> Branch....
> > The filter on product worked better compared to Branch... consistently...
> > The branch ran almost 1.6x slower than the filter on Product..... This
> was
> > on a small synthetic dataset - 10million entries....
> >
> > Best,
> > Sarnath
> >
> >
> > On Sat, Nov 14, 2015 at 8:57 PM, ShaoFeng Shi <[email protected]>
> > wrote:
> >
> > > Kylin doesn't need full segment scan. It only need scan one Cuboid (one
> > > combination of dimensions), which is a subset of a segment.
> > >
> > > If there is "where" condition in query, Kylin will try to narrow down
> the
> > > scan key range with the given values, but this depends on the sequence
> of
> > > the dimension rows on rowkey (I think you can understand it). This is
> why
> > > the sequence of rowkey is so important for query performance.
> > >
> > > Besides, "where" conditions will be sent to HBaser coprocessor to do
> > server
> > > side filtering.
> > >
> > >
> > >
> > > 2015-11-13 18:36 GMT+08:00 Sarnath <[email protected]>:
> > >
> > > > Hi All,
> > > > Does kylin perform full segment scans on certain GROUP BY followed by
> > > WHERE
> > > > clause?
> > > > This, I think, is because of rowkey hbase design. Can some1 confirm
> my
> > > > understanding?
> > > > Best,
> > > > Sarnath
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > > Shaofeng Shi
> > >
> >
>
>
>
> --
> Best regards,
>
> Shaofeng Shi
>

Re: Group by + where clause

Reply via email to