On Mon, Mar 16, 2015 at 2:11 AM, dong wang <[email protected]> wrote:
> hi shaofeng, one more question, suppose that we have SQL like: select *
> from t1 where col1 in {A} group by..
>
> set {A} is a subset of set {C}, say {B} = {C} - {A}, then {B} is
> the difference set, and suppose that |A| > |B|, |A| indicates
> A's cardinality, then tests show the performance of select * from t1 where
> col1 NOT IN {B} group by.. is much better that the performance of select *
> from t1 where col1 IN {A} group by.. especially when |A| is much greater
> than |B|, so, could anyone please have a explain?
>
hi, dong,
Putting lots of elements in IN clause will slow performance since it will
generate hbase scans proportional to the count of elements.
However, if the In clause is preceded with a NOT, Kylin will ignore the
filter temporally(thus instead of thousands of hbase scan, it might
actually scan only once), and postpone the filter to a very later stage.
This leads to your observations.
To summarize, putting too many elements in IN clause is not a good idea. If
you do, preceding it with NOT might help to contribute performance.