If you would agree the problem is covered in the issue https://issues.apache.org/jira/browse/KYLIN-740,
I'm working on it, please add more requirements on this to that ticket. On Fri, Aug 21, 2015 at 11:54 PM, vipul jhawar <[email protected]> wrote: > yup we thought about that option but that limits us at the case when we > have to use all the dimensions as that will be the worst case. We had cubes > with 4 - 5 dimension groups before. Move to kyling and multi dimensional > dashboard was built so that we could leverage 9-12 dimensions in a single > cube. > > Thanks > > On Fri, Aug 21, 2015 at 8:15 PM, Huang Hua <[email protected]> > wrote: > > > Adding more machines should help to some extent. > > > > Before considering adding more machines, is it possible to divide the > > query dimensions into different small groups according to business > > requirement? > > If so, you can build multiple cubes where each of them corresponds to a > > group of certain dimensions, and Kylin itself does good job on > auto-routing > > queries to one of those cubes by a best-matching algorithm. And more > > importantly, with reduced number of dimensions, you can probably still > > maintain a very responsive dashboard. > > > > But if the dashboard application is meant to allow queries with arbitrary > > combinations of dimensions, then the above approach won't work. > > > > > -----邮件原件----- > > > 发件人: dev-return-3769- > > > [email protected] [mailto:dev-return- > > > [email protected]] 代表 vipul > > > jhawar > > > 发送时间: 2015年8月21日 22:10 > > > 收件人: [email protected] > > > 主题: Re: 答复: Queries with filters and coprocessors high cpu usage > > > > > > Just to add to vadim's query, we want to leverage kylin cube with many > > > dimensions for a very responsive dashboard, which allows using > selecting > > > different values among the dimensions as filters which get set in the > IN > > > clause, so we would not want to compromise with this feature. What > would > > > be the possible strategies to overcome some of these issues. Could this > > be > > > solved with scaling horizontally or throwing more hardware in the > > cluster ? > > > Any tips on the sizing would be appreciated. > > > > > > On Fri, Aug 21, 2015 at 7:33 PM, Huang Hua <[email protected]> > > > wrote: > > > > > > > I suspect that the reason is most likely related to the "IN" > > statements. > > > > > > > > As far as I know, the current scan algorithm for the "IN" statements > > > > is to use the minimal value and the maximum value from the "IN" value > > > > list to come up with the hbase scan range. In the worst case, such > > > > range can be very big. For example, let's say the "IN" statement > looks > > > > like "in (1, 2, 3, 1000000)" and then kylin will scan the range [1, > > > > 1000000] to get back the results which is sometimes equivalent to a > > full > > > table scan. > > > > > > > > And I am guessing that you were generating the queries randomly, > which > > > > would probably produce "IN" statements with big ranges and gives not > > > > so-well performance. > > > > > > > > > -----邮件原件----- > > > > > 发件人: dev-return-3767- > > > > > [email protected] > > > > > [mailto:dev-return- > > > > > [email protected]] 代表 > > > Vadim > > > > > Semenov > > > > > 发送时间: 2015年8月21日 21:16 > > > > > 收件人: [email protected] > > > > > 主题: Queries with filters and coprocessors high cpu usage > > > > > > > > > > Hi, > > > > > > > > > > I've been experimenting with Kylin for some time, and I ran into a > > > > difficult > > > > > problem: > > > > > > > > > > I have a cube (total size ~150GB, ~1.1B source records) with the > > > > following > > > > > dimensions and cardinalities (as they defined in the aggregation > > group): > > > > > date 10 > > > > > dim0 250 STRING > > > > > dim1 60 STRING > > > > > dim2 3000 INT > > > > > dim3 7000 INT > > > > > dim4 30 INT > > > > > dim5 20 INT > > > > > dim6 30 INT > > > > > dim7 10 INT > > > > > > > > > > When I execute queries like this (accept partial = false): > > > > > > > > > > SELECT dim1, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN … > > > AND > > > > > dim0 IN (10 values) AND > > > > > dim2 IN (10 values) > > > > > GROUP BY dim1 LIMIT 10; > > > > > > > > > > SELECT dim7, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN … > > > AND > > > > > dim0 IN (10 values) AND > > > > > dim2 IN (10 values) AND > > > > > dim3 IN (10 values) > > > > > GROUP BY dim7 LIMIT 10; > > > > > > > > > > SELECT dim7, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN … > > > AND > > > > > dim0 IN (10 values) AND > > > > > dim2 IN (10 values) AND > > > > > dim3 IN (10 values) AND > > > > > dim4 IN (10 values) AND > > > > > dim6 IN (10 values) > > > > > GROUP BY dim7 LIMIT 10; > > > > > > > > > > > > > > > Coprocessors consume 100% CPU on some of the region servers and > > > never > > > > > finish. > > > > > I tried to profile a region server and got the following: > > > > > http://i.imgur.com/yrKnDc1.png > > > > > > > > > > I tried to disable fuzzy key feature using backdoorToggles, and got > > much > > > > > better results: coprocessors don't get stuck anymore and I always > get > > > > > response. Though response time suffered a bit but overall > > > responsiveness > > > > is > > > > > much better. > > > > > > > > > > Query times I get for the queries (accept partial = false): > > > > > 1. 5-10 seconds > > > > > 2. 30-100 seconds > > > > > 3. 180-300 seconds > > > > > > > > > > So my questions are: > > > > > 1. Are there ways to improve query time for this kind of queries? > > > > > 2. Why coprocessors consume 100% cpu and never finish with enabled > > > fuzzy > > > > > key? > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > -- Regards, *Bin Mahone | 马洪宾* Apache Kylin: http://kylin.io Github: https://github.com/binmahone
