About the efficiency of 'IN' keyword

Huang Hua Tue, 28 Apr 2015 10:13:11 -0700

Hi,


We were testing the query efficiency with some sql containing 'IN' keyword
on a relatively big table which holds 22+ Million records.

 

It seems to us that sql with a number of items in 'IN' show very poor
performance in terms of running time.

 

The fact table used has three columns: themonth, id, trans_at, and we
created the cube with 2 dimensions and 1 measure:

Dimensions: themonth, id

Measure: sum(trans_at)

 

Basically our queries are all similar to the one:

 

select themonth, id, sum(trans_at) from the_table where id IN (id1, id2,
id3, .) group by themonth, id limit 10

 


When the number of ids reaches about 30-50, the response time becomes
considerable long in terms of minutes. By the way, we are running kylin on
12 nodes with 2 name nodes and 10 data nodes.

 

Besides, we tried to break the above sql with 'IN' keyword into several
sub-sql and merge the results of each sub-sql to get the final result like
the following:

select themonth, id, sum(trans_at) from the_table where id = id1 group by
themonth, id limit 10

select themonth, id, sum(trans_at) from the_table where id = id2 group by
themonth, id limit 10

select themonth, id, sum(trans_at) from the_table where id = id3 group by
themonth, id limit 10

.

 

Surprisedly, the total running time of those sub-sql is much less than the
running time of the orginal sql with 'IN' keyword.

 

Initially I thought the backend query engine should handle 'IN' keyword in
the similar way as the individual sql with '=' keyword, but it seems not.

 

Can anybody provide any thoughts regarding this? Any ideas on how to tune
the queries containing 'IN' keyword?

 

Best Regards. 

Hua

About the efficiency of 'IN' keyword

Reply via email to