Thanks, hongbin, yes, it indeeded matches what I observed. and we have a new strategy to deal with the IN and NOT IN syntax in SELECT statement.
2015-03-18 12:25 GMT+08:00 hongbin ma <[email protected]>: > On Mon, Mar 16, 2015 at 2:11 AM, dong wang <[email protected]> wrote: > > > hi shaofeng, one more question, suppose that we have SQL like: select * > > from t1 where col1 in {A} group by.. > > > > set {A} is a subset of set {C}, say {B} = {C} - {A}, then {B} is > > the difference set, and suppose that |A| > |B|, |A| indicates > > A's cardinality, then tests show the performance of select * from t1 > where > > col1 NOT IN {B} group by.. is much better that the performance of select > * > > from t1 where col1 IN {A} group by.. especially when |A| is much > greater > > than |B|, so, could anyone please have a explain? > > > > hi, dong, > > Putting lots of elements in IN clause will slow performance since it will > generate hbase scans proportional to the count of elements. > > However, if the In clause is preceded with a NOT, Kylin will ignore the > filter temporally(thus instead of thousands of hbase scan, it might > actually scan only once), and postpone the filter to a very later stage. > This leads to your observations. > > To summarize, putting too many elements in IN clause is not a good idea. If > you do, preceding it with NOT might help to contribute performance. >
