Thanks, hongbin, yes, it indeeded matches what I observed. and we have a
new strategy to deal with the IN and NOT IN syntax in SELECT statement.

2015-03-18 12:25 GMT+08:00 hongbin ma <[email protected]>:

> On Mon, Mar 16, 2015 at 2:11 AM, dong wang <[email protected]> wrote:
>
> > hi shaofeng, one more question, suppose that we have SQL like: select *
> > from t1 where col1 in {A} group by..
> >
> > set {A} is a subset of set {C}, say {B} = {C} - {A}, then {B} is
> > the difference set,  and suppose that |A| > |B|,  |A| indicates
> > A's cardinality, then tests show the performance of select * from t1
> where
> > col1 NOT IN {B} group by.. is much better that the performance of select
> *
> > from t1 where col1 IN {A} group by..  especially when  |A| is much
> greater
> > than |B|, so, could anyone please have a explain?
> >
>
> ​hi, dong,
>
> Putting lots of elements in IN clause will slow performance ​since it will
> generate hbase scans proportional to the count of elements.
>
> However, if the In clause is preceded with a NOT, Kylin will ignore the
> filter temporally(thus instead of thousands of hbase scan, it might
> actually scan only once), and postpone the filter to a very later stage.
> This leads to your observations.
>
> To summarize, putting too many elements in IN clause is not a good idea. If
> you do, preceding it with NOT might help to contribute performance.
>

Reply via email to