Hi,
We were testing the query efficiency with some sql containing 'IN' keyword on a relatively big table which holds 22+ Million records. It seems to us that sql with a number of items in 'IN' show very poor performance in terms of running time. The fact table used has three columns: themonth, id, trans_at, and we created the cube with 2 dimensions and 1 measure: Dimensions: themonth, id Measure: sum(trans_at) Basically our queries are all similar to the one: select themonth, id, sum(trans_at) from the_table where id IN (id1, id2, id3, .) group by themonth, id limit 10 When the number of ids reaches about 30-50, the response time becomes considerable long in terms of minutes. By the way, we are running kylin on 12 nodes with 2 name nodes and 10 data nodes. Besides, we tried to break the above sql with 'IN' keyword into several sub-sql and merge the results of each sub-sql to get the final result like the following: select themonth, id, sum(trans_at) from the_table where id = id1 group by themonth, id limit 10 select themonth, id, sum(trans_at) from the_table where id = id2 group by themonth, id limit 10 select themonth, id, sum(trans_at) from the_table where id = id3 group by themonth, id limit 10 . Surprisedly, the total running time of those sub-sql is much less than the running time of the orginal sql with 'IN' keyword. Initially I thought the backend query engine should handle 'IN' keyword in the similar way as the individual sql with '=' keyword, but it seems not. Can anybody provide any thoughts regarding this? Any ideas on how to tune the queries containing 'IN' keyword? Best Regards. Hua
