Re: 答复: 答复: Queries with filters and coprocessors high cpu usage

hongbin ma Sun, 23 Aug 2015 23:26:12 -0700

If you would agree the problem is covered in the issue
https://issues.apache.org/jira/browse/KYLIN-740,


I'm working on it, please add more requirements on this to that ticket.

On Fri, Aug 21, 2015 at 11:54 PM, vipul jhawar <[email protected]>
wrote:

> yup we thought about that option but that limits us at the case when we
> have to use all the dimensions as that will be the worst case. We had cubes
> with 4 - 5 dimension groups before. Move to kyling and multi dimensional
> dashboard was built so that we could leverage 9-12 dimensions in a single
> cube.
>
> Thanks
>
> On Fri, Aug 21, 2015 at 8:15 PM, Huang Hua <[email protected]>
> wrote:
>
> > Adding more machines should help to some extent.
> >
> > Before considering adding more machines, is it possible to divide the
> > query dimensions into different small groups according to business
> > requirement?
> > If so, you can build multiple cubes where each of them corresponds to a
> > group of certain dimensions, and Kylin itself does good job on
> auto-routing
> > queries to one of those cubes by a best-matching algorithm. And more
> > importantly, with reduced number of dimensions, you can probably still
> > maintain a very responsive dashboard.
> >
> > But if the dashboard application is meant to allow queries with arbitrary
> > combinations of dimensions, then the above approach won't work.
> >
> > > -----邮件原件-----
> > > 发件人: dev-return-3769-
> > > [email protected] [mailto:dev-return-
> > > [email protected]] 代表 vipul
> > > jhawar
> > > 发送时间: 2015年8月21日 22:10
> > > 收件人: [email protected]
> > > 主题: Re: 答复: Queries with filters and coprocessors high cpu usage
> > >
> > > Just to add to vadim's query, we want to leverage kylin cube with many
> > > dimensions for a very responsive dashboard, which allows using
> selecting
> > > different values among the dimensions as filters which get set in the
> IN
> > > clause, so we would not want to compromise with this feature. What
> would
> > > be the possible strategies to overcome some of these issues. Could this
> > be
> > > solved with scaling horizontally or throwing more hardware in the
> > cluster ?
> > > Any tips on the sizing would be appreciated.
> > >
> > > On Fri, Aug 21, 2015 at 7:33 PM, Huang Hua <[email protected]>
> > > wrote:
> > >
> > > > I suspect that the reason is most likely related to the "IN"
> > statements.
> > > >
> > > > As far as I know, the current scan algorithm for the "IN" statements
> > > > is to use the minimal value and the maximum value from the "IN" value
> > > > list to come up with the hbase scan range. In the worst case, such
> > > > range can be very big. For example, let's say the "IN" statement
> looks
> > > > like "in (1, 2, 3, 1000000)" and then kylin will scan the range [1,
> > > > 1000000] to get back the results which is sometimes equivalent to a
> > full
> > > table scan.
> > > >
> > > > And I am guessing that you were generating the queries randomly,
> which
> > > > would probably produce "IN" statements with big ranges and gives not
> > > > so-well performance.
> > > >
> > > > > -----邮件原件-----
> > > > > 发件人: dev-return-3767-
> > > > > [email protected]
> > > > > [mailto:dev-return-
> > > > > [email protected]] 代表
> > > Vadim
> > > > > Semenov
> > > > > 发送时间: 2015年8月21日 21:16
> > > > > 收件人: [email protected]
> > > > > 主题: Queries with filters and coprocessors high cpu usage
> > > > >
> > > > > Hi,
> > > > >
> > > > > I've been experimenting with Kylin for some time, and I ran into a
> > > > difficult
> > > > > problem:
> > > > >
> > > > > I have a cube (total size ~150GB, ~1.1B source records) with the
> > > > following
> > > > > dimensions and cardinalities (as they defined in the aggregation
> > group):
> > > > > date 10
> > > > > dim0 250 STRING
> > > > > dim1 60 STRING
> > > > > dim2 3000 INT
> > > > > dim3 7000 INT
> > > > > dim4 30 INT
> > > > > dim5 20 INT
> > > > > dim6 30 INT
> > > > > dim7 10 INT
> > > > >
> > > > > When I execute queries like this (accept partial = false):
> > > > >
> > > > > SELECT dim1, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN …
> > > AND
> > > > > dim0 IN (10 values) AND
> > > > > dim2 IN (10 values)
> > > > > GROUP BY dim1 LIMIT 10;
> > > > >
> > > > > SELECT dim7, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN …
> > > AND
> > > > > dim0 IN (10 values) AND
> > > > > dim2 IN (10 values) AND
> > > > > dim3 IN (10 values)
> > > > > GROUP BY dim7 LIMIT 10;
> > > > >
> > > > > SELECT dim7, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN …
> > > AND
> > > > > dim0 IN (10 values) AND
> > > > > dim2 IN (10 values) AND
> > > > > dim3 IN (10 values) AND
> > > > > dim4 IN (10 values) AND
> > > > > dim6 IN (10 values)
> > > > > GROUP BY dim7 LIMIT 10;
> > > > >
> > > > >
> > > > > Coprocessors consume 100% CPU on some of the region servers and
> > > never
> > > > > finish.
> > > > > I tried to profile a region server and got the following:
> > > > > http://i.imgur.com/yrKnDc1.png
> > > > >
> > > > > I tried to disable fuzzy key feature using backdoorToggles, and got
> > much
> > > > > better results: coprocessors don't get stuck anymore and I always
> get
> > > > > response. Though response time suffered a bit but overall
> > > responsiveness
> > > > is
> > > > > much better.
> > > > >
> > > > > Query times I get for the queries (accept partial = false):
> > > > > 1. 5-10 seconds
> > > > > 2. 30-100 seconds
> > > > > 3. 180-300 seconds
> > > > >
> > > > > So my questions are:
> > > > > 1. Are there ways to improve query time for this kind of queries?
> > > > > 2. Why coprocessors consume 100% cpu and never finish with enabled
> > > fuzzy
> > > > > key?
> > > > >
> > > > > Thanks.
> > > >
> > > >
> > > >
> >
> >
> >
>



-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: 答复: 答复: Queries with filters and coprocessors high cpu usage

Reply via email to