Opened JIRA for this: https://issues.apache.org/jira/browse/KYLIN-740

The root cause is currently more than 7 scan ranges will be merged into
one, cause a very big scan range.  Every value inside IN() will make a scan
range, and when there are more than 7, a big scan range is created.

For a quick dirty fix, I'm going to increase the threshold to 100. The
right solution, however, should be calculate the distances between ranges
and only merge small gaps.

Cheers
Yang

On Tue, Apr 28, 2015 at 8:59 PM, Huang Hua <[email protected]> wrote:

> Hi,
>
>
>
> We were testing the query efficiency with some sql containing 'IN' keyword
> on a relatively big table which holds 22+ Million records.
>
>
>
> It seems to us that sql with a number of items in 'IN' show very poor
> performance in terms of running time.
>
>
>
> The fact table used has three columns: themonth, id, trans_at, and we
> created the cube with 2 dimensions and 1 measure:
>
> Dimensions: themonth, id
>
> Measure: sum(trans_at)
>
>
>
> Basically our queries are all similar to the one:
>
>
>
> select themonth, id, sum(trans_at) from the_table where id IN (id1, id2,
> id3, .) group by themonth, id limit 10
>
>
>
>
> When the number of ids reaches about 30-50, the response time becomes
> considerable long in terms of minutes. By the way, we are running kylin on
> 12 nodes with 2 name nodes and 10 data nodes.
>
>
>
> Besides, we tried to break the above sql with 'IN' keyword into several
> sub-sql and merge the results of each sub-sql to get the final result like
> the following:
>
> select themonth, id, sum(trans_at) from the_table where id = id1 group by
> themonth, id limit 10
>
> select themonth, id, sum(trans_at) from the_table where id = id2 group by
> themonth, id limit 10
>
> select themonth, id, sum(trans_at) from the_table where id = id3 group by
> themonth, id limit 10
>
> .
>
>
>
> Surprisedly, the total running time of those sub-sql is much less than the
> running time of the orginal sql with 'IN' keyword.
>
>
>
> Initially I thought the backend query engine should handle 'IN' keyword in
> the similar way as the individual sql with '=' keyword, but it seems not.
>
>
>
> Can anybody provide any thoughts regarding this? Any ideas on how to tune
> the queries containing 'IN' keyword?
>
>
>
> Best Regards.
>
> Hua
>
>

Reply via email to