On Mon, Mar 16, 2015 at 2:11 AM, dong wang <[email protected]> wrote:

> hi shaofeng, one more question, suppose that we have SQL like: select *
> from t1 where col1 in {A} group by..
>
> set {A} is a subset of set {C}, say {B} = {C} - {A}, then {B} is
> the difference set,  and suppose that |A| > |B|,  |A| indicates
> A's cardinality, then tests show the performance of select * from t1 where
> col1 NOT IN {B} group by.. is much better that the performance of select *
> from t1 where col1 IN {A} group by..  especially when  |A| is much greater
> than |B|, so, could anyone please have a explain?
>

​hi, dong,

Putting lots of elements in IN clause will slow performance ​since it will
generate hbase scans proportional to the count of elements.

However, if the In clause is preceded with a NOT, Kylin will ignore the
filter temporally(thus instead of thousands of hbase scan, it might
actually scan only once), and postpone the filter to a very later stage.
This leads to your observations.

To summarize, putting too many elements in IN clause is not a good idea. If
you do, preceding it with NOT might help to contribute performance.

Reply via email to