I suspect that the reason is most likely related to the "IN" statements.

As far as I know, the current scan algorithm for the "IN" statements is to use 
the minimal value and the maximum value from the "IN" value list to come up 
with the hbase scan range. In the worst case, such range can be very big. For 
example, let's say the "IN" statement looks like "in (1, 2, 3, 1000000)" and 
then kylin will scan the range [1, 1000000] to get back the results which is 
sometimes equivalent to a full table scan.

And I am guessing that you were generating the queries randomly, which would 
probably produce "IN" statements with big ranges and gives not so-well 
performance.
 
> -----邮件原件-----
> 发件人: dev-return-3767-
> [email protected] [mailto:dev-return-
> [email protected]] 代表 Vadim
> Semenov
> 发送时间: 2015年8月21日 21:16
> 收件人: [email protected]
> 主题: Queries with filters and coprocessors high cpu usage
> 
> Hi,
> 
> I've been experimenting with Kylin for some time, and I ran into a difficult
> problem:
> 
> I have a cube (total size ~150GB, ~1.1B source records) with the following
> dimensions and cardinalities (as they defined in the aggregation group):
> date 10
> dim0 250 STRING
> dim1 60 STRING
> dim2 3000 INT
> dim3 7000 INT
> dim4 30 INT
> dim5 20 INT
> dim6 30 INT
> dim7 10 INT
> 
> When I execute queries like this (accept partial = false):
> 
> SELECT dim1, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN …
> AND
> dim0 IN (10 values) AND
> dim2 IN (10 values)
> GROUP BY dim1 LIMIT 10;
> 
> SELECT dim7, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN …
> AND
> dim0 IN (10 values) AND
> dim2 IN (10 values) AND
> dim3 IN (10 values)
> GROUP BY dim7 LIMIT 10;
> 
> SELECT dim7, SUM(m0), SUM(m1), … FROM fact WHERE date BETWEEN …
> AND
> dim0 IN (10 values) AND
> dim2 IN (10 values) AND
> dim3 IN (10 values) AND
> dim4 IN (10 values) AND
> dim6 IN (10 values)
> GROUP BY dim7 LIMIT 10;
> 
> 
> Coprocessors consume 100% CPU on some of the region servers and never
> finish.
> I tried to profile a region server and got the following:
> http://i.imgur.com/yrKnDc1.png
> 
> I tried to disable fuzzy key feature using backdoorToggles, and got much
> better results: coprocessors don't get stuck anymore and I always get
> response. Though response time suffered a bit but overall responsiveness is
> much better.
> 
> Query times I get for the queries (accept partial = false):
> 1. 5-10 seconds
> 2. 30-100 seconds
> 3. 180-300 seconds
> 
> So my questions are:
> 1. Are there ways to improve query time for this kind of queries?
> 2. Why coprocessors consume 100% cpu and never finish with enabled fuzzy
> key?
> 
> Thanks.


Reply via email to