Re: Re: [DISCUSS] New Join Type: ANTI_NOTIN

Haisheng Yuan Tue, 21 Jul 2020 13:18:24 -0700

Thanks Julian for explaining the NOT IN to everyone. I am sorry for not making 
myself clear.


For that kind of simple 2-valued logic uncorrelated NOT IN subquery, SQL Server 
and Greenplum can generate efficient plan with null aware ANTI SEMI join.

Hash ANTI SEMI Join for NOT IN 
         Join Condition: t1.a = t2.a
         ->  Table Scan on t1  
         ->  Hash 
               ->  Table Scan on t2 

ANTI_NOTIN can be interpreted as null awareness ANTI joins. As vlsi pointed 
out, Oracle supports that too.

Of course, that requires the query executor to deal specifically with this kind 
of join. But optimizer should not be the database's bottleneck, it should 
generate as good plan as possible, if the query executor can support the 
operator, that is good. If not, downstream project just disable it.

Just like "a in (1,2,3...1000)", many databases (Greenplum, Postgres, Tidb ...) 
support implicit lookup inside the Filter operator, it doesn't need to be 
expanded to OR, or an explicit inner join. The implicit lookup is more 
efficient. I think the downstream would love to adapt to support the good plan, 
if not, they can choose to disable it, or rewrite to the form they support 
during post-processing stage after best plan is generated. But we can't limit 
us not to generate this kind of plan because some downstream project might not 
support it.

On 2020/07/21 18:41:39, Julian Hyde <[email protected]> wrote: 
> I want to remind everyone how hard it is to evaluate NOT IN queries.
> Here is an example query:
> 
> sqlline> !connect
> jdbc:calcite:model=core/build/resources/test/hsqldb-model.json sa sa
> > !set outputFormat csv
> > EXPLAIN PLAN FOR SELECT * FROM dept WHERE deptno NOT IN (SELECT mgr FROM 
> > emp);
> EnumerableCalc(expr#0..6=[{inputs}], expr#7=[0], expr#8=[=($t3, $t7)],
> expr#9=[IS NULL($t6)], expr#10=[>=($t4, $t3)], expr#11=[AND($t9,
> $t10)], expr#12=[OR($t8, $t11)], proj#0..2=[{exprs}],
> $condition=[$t12])
>   EnumerableHashJoin(condition=[=($0, $5)], joinType=[left])
>     EnumerableNestedLoopJoin(condition=[true], joinType=[inner])
>       JdbcToEnumerableConverter
>         JdbcTableScan(table=[[SCOTT, DEPT]])
>       JdbcToEnumerableConverter
>         JdbcAggregate(group=[{}], c=[COUNT()], ck=[COUNT($3)])
>           JdbcTableScan(table=[[SCOTT, EMP]])
>     JdbcToEnumerableConverter
>       JdbcAggregate(group=[{0, 1}])
>         JdbcProject(MGR=[$3], i=[true])
>           JdbcTableScan(table=[[SCOTT, EMP]])
> > EXPLAIN PLAN WITHOUT IMPLEMENTATION FOR SELECT * FROM dept WHERE deptno NOT 
> > IN (SELECT mgr FROM emp);
> LogicalProject(DEPTNO=[$0], DNAME=[$1], LOC=[$2])
>   LogicalFilter(condition=[NOT(IN($0, {
> LogicalProject(MGR=[$3])
>   JdbcTableScan(table=[[SCOTT, EMP]])
> }))])
>     JdbcTableScan(table=[[SCOTT, DEPT]])
> 
> Why is the physical plan so complicated? Because 'mgr' is nullable. We
> need to account for 3 different cases:
> 
> 1. dept.deptno is NULL (and therefore 'deptno NOT IN ...' evaluates to
> UNKNOWN for every row where dept.deptno is NULL);
> 2. dept.deptno is not NULL and the sub-query returns at least one NULL
> value for mgr (and therefore 'deptno not in ...' evaluates to UNKNOWN
> for every row);
> 3. dept.deptno is not NULL and the sub-query returns only non-NULL
> values of mgr (and therefore 'deptno not in ...' evaluates to TRUE or
> FALSE for every row).
> 
> To distinguish between cases 2 and 3, the plan counts the number of
> values and the number of not-null values from the sub-query.
> 
> I say that 'NOT IN is toxic' because a single null value in the
> sub-query affects the result. The IN sub-query returns 3 values and
> relational join can only account for two - match or not match.
> 
> There are actually cases where we care about the 3 values of IN. For
> example 'SELECT *, deptno IN (SELECT mgr FROM emp) FROM dept'. But
> usually IN occurs inside WHERE, and we can safely fold UNKNOWN into
> FALSE.
> 
> It is tempting to talk about the cases where there are no NULL keys,
> or UNKNOWN can safely be folded into FALSE. But I think we should be
> talking about 3-valued IN (e.g. the scalar sub-query in the previous
> paragraph). If we can solve that, we can easily convert to a solution
> for 3-valued NOT IN.
> 
> Julian
> 
> On Mon, Jul 20, 2020 at 11:25 PM Haisheng Yuan <[email protected]> wrote:
> >
> > I think they might be orthogonal.
> > It is all about sub-query.
> >
> > On 2020/07/21 05:48:54, Danny Chan <[email protected]> wrote:
> > > If it is only constant NOT IN predicate, how difficult it is to rewrite 
> > > it into a normal composite AND predicate before entering the planning 
> > > phrase ?
> > >
> > > Best,
> > > Danny Chan
> > > 在 2020年7月21日 +0800 PM12:35，Haisheng Yuan <[email protected]>，写道：
> > > > Thanks Jinpeng for providing a good example for not in subquery.
> > > >
> > > > I 100% agree with you that correlated query won't be represented by 
> > > > ANTI_NOTIN join type, and it is not the proposal's intention. Here what 
> > > > we are discussing is not to use ANTI_NOTIN to represent all the NOT IN 
> > > > sub-queries, that is impossible. Instead, if you take a close look at 
> > > > the example query, it is a simple uncorrelated NOT IN sub-query. That 
> > > > is the target. Let's focus on that kind of query, ask ourselves this 
> > > > question: Can such a simple query be transformed into a ANTI join to 
> > > > make the plan efficient?
> > > >
> > > > Sadly no. The reality is that this kind of query is not uncommon, may 
> > > > be much more common than correlated NOT IN sub-queries.
> > > >
> > > >
> > > > Reply to Julian:
> > > > > > How about making a sub-query type (in RexSubQuery), so it is gone
> > > > > > before we reach algebra.
> > > > It will be nice to have a NOT_IN subquery type, without expanding NOT 
> > > > IN to NOT(IN....).
> > > > However, if there is no ANTI_NOTIN in the join type (without reaching 
> > > > algebra), does that mean the optimizer still can't generate efficient 
> > > > plan for simple NOT IN sub-queries?
> > > >
> > > > > > ANTI_NOTIN is a terrible name. ANTI means 'opposite' to ANTI_NOTIN 
> > > > > > is
> > > > > > the opposite of NOT IN?!
> > > > It depends how people interpret ANTI. You interpret it as "opposite", I 
> > > > interpret it as "ANTI JOIN", means "anti join for NOT IN, instead of 
> > > > NOT EXISTS". But it is just a naming issue, I am OK to change it 
> > > > whatever name that makes sense to the community, as long as it can 
> > > > convey the meaning.
> > > >
> > > > Thanks,
> > > > Haisheng
> > > >
> > > > On 2020/07/21 03:02:20, Jinpeng Wu <[email protected]> wrote:
> > > > > Hi.
> > > > >
> > > > > In some SQL engine, the query
> > > > > select * from A where c1 not in ( select c1 from B where B.c2 = A.c2);
> > > > > is transformed to a plan like
> > > > > select * from A LEFT ANTI JOIN B on A.c2 = B.c2 AND (A.c1 = B.c1 OR 
> > > > > A.c1 is
> > > > > null OR B.c1 is null);
> > > > >
> > > > > Here, the "LEFT ANTI JOIN" is nothing more than traditional 
> > > > > definition. One
> > > > > thing seems to be a problem is that A.c1 cannot be used as a join key 
> > > > > in
> > > > > the new plan. However, the problem is also there for ANTI_NOTIN, and 
> > > > > even
> > > > > other NOT-IN-SUBQUERY physical implementations.
> > > > >
> > > > > Thanks,
> > > > > Qiupeng
> > > > >
> > > > > On Tue, Jul 21, 2020 at 5:30 AM Julian Hyde <[email protected]> wrote:
> > > > >
> > > > > > How about making a sub-query type (in RexSubQuery), so it is gone
> > > > > > before we reach algebra.
> > > > > >
> > > > > > ANTI_NOTIN is a terrible name. ANTI means 'opposite' to ANTI_NOTIN 
> > > > > > is
> > > > > > the opposite of NOT IN?!
> > > > > >
> > > > > > On Mon, Jul 20, 2020 at 1:00 PM Haisheng Yuan <[email protected]> 
> > > > > > wrote:
> > > > > > >
> > > > > > > Typo:
> > > > > > > We can just add a security guard saying that it is supported.
> > > > > > > Should be
> > > > > > > We can just add a security guard saying that it is NOT supported.
> > > > > > >
> > > > > > > On 2020/07/20 19:57:34, Haisheng Yuan <[email protected]> wrote:
> > > > > > > > I am not sure I got your implication by "pollute". If you mean
> > > > > > changes, yes, it requires some changes in rules. Do we need to 
> > > > > > change
> > > > > > enumerables? Not necessary. We can just add a security guard saying 
> > > > > > that it
> > > > > > is supported. Not everyone requires the Enumerable operators to 
> > > > > > support
> > > > > > everything. More importantly, currently there is no logic or rules 
> > > > > > to
> > > > > > translate sub-query directly to SEMI/ANTI joins, let alone 
> > > > > > translating
> > > > > > directly to ANTI_NOTIN. Currently NOT IN is expanded to NOT(IN ...) 
> > > > > > before
> > > > > > entering RelNode land. That means we don't even have the chance to 
> > > > > > generate
> > > > > > the NOT IN anti join. Is that still a concern?
> > > > > > > >
> > > > > > > > Even if some day, some contributor extends Calcite's parser and
> > > > > > SubqueryRemovalRule to be able to transform NOT_IN subquery into 
> > > > > > NOT IN
> > > > > > anti join, we still have chance to disable it. Is that still a 
> > > > > > concern?
> > > > > > > >
> > > > > > > > There are many ways to play it safe.
> > > > > > > >
> > > > > > > > > Brainstorming: maybe we could consider it as a separate 
> > > > > > > > > logical
> > > > > > operator
> > > > > > > > > (with its corresponding enumerable implementation)?
> > > > > > > > It doesn't sound cool. It requires much more work. You have to
> > > > > > duplicate all the rules, metadata handler that deal with 
> > > > > > LogicalJoin, and
> > > > > > for some rule that matches Join base class, you have to check it is 
> > > > > > a
> > > > > > LogicalJoin or the logical operator for ANTI_NOTIN.
> > > > > > > >
> > > > > > > > On 2020/07/20 08:28:42, Ruben Q L <[email protected]> wrote:
> > > > > > > > > I have some concerns that this new type would "pollute" the 
> > > > > > > > > existing
> > > > > > Join
> > > > > > > > > logic, rules and enumerable implementations.
> > > > > > > > >
> > > > > > > > > Brainstorming: maybe we could consider it as a separate 
> > > > > > > > > logical
> > > > > > operator
> > > > > > > > > (with its corresponding enumerable implementation)?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Le lun. 20 juil. 2020 à 06:08, Haisheng Yuan 
> > > > > > > > > <[email protected]>
> > > > > > a
> > > > > > > > > écrit :
> > > > > > > > >
> > > > > > > > > > I agree that NOT IN is toxic, and it is error-prone.
> > > > > > > > > > But you can't prevent people writing SQL with not in 
> > > > > > > > > > sub-queries,
> > > > > > would
> > > > > > > > > > you rather let optimizer generate inefficient plan?
> > > > > > > > > >
> > > > > > > > > > - Haisheng
> > > > > > > > > >
> > > > > > > > > > ------------------------------------------------------------------
> > > > > > > > > > 发件人：Julian Hyde<[email protected]>
> > > > > > > > > > 日 期：2020年07月20日 11:56:35
> > > > > > > > > > 收件人：[email protected]<[email protected]>
> > > > > > > > > > 主 题：Re: [DISCUSS] New Join Type: ANTI_NOTIN
> > > > > > > > > >
> > > > > > > > > > Yuck!
> > > > > > > > > >
> > > > > > > > > > NOT IN is toxic. I'd rather keep it out of the algebra.
> > > > > > > > > >
> > > > > > > > > > On Sun, Jul 19, 2020 at 8:24 PM Haisheng Yuan 
> > > > > > > > > > <[email protected]>
> > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > Currently, JoinRelType.ANTI only represents NOT_EXISTS 
> > > > > > > > > > > subquery
> > > > > > (thanks
> > > > > > > > > > to Ruben for reminding).
> > > > > > > > > > > For some simple boolean context NOT_IN subquery, we can't
> > > > > > transform it
> > > > > > > > > > to ANTI join. e.g.:
> > > > > > > > > > >
> > > > > > > > > > > SELECT * FROM foo WHERE a NOT IN (SELECT b FROM bar); -- 
> > > > > > > > > > > bar.b is
> > > > > > > > > > nullable
> > > > > > > > > > >
> > > > > > > > > > > Because if there is a null value in the results of 
> > > > > > > > > > > subquery, the
> > > > > > NOT IN
> > > > > > > > > > predicate will return false, the whole query returns empty. 
> > > > > > > > > > And in
> > > > > > Calcite,
> > > > > > > > > > the plan for this kind of query is inefficient.
> > > > > > > > > > >
> > > > > > > > > > > If we have ANTI_NOTIN to represent this kind of join, we 
> > > > > > > > > > > can
> > > > > > generate
> > > > > > > > > > more efficient plan, as long as the query executor support 
> > > > > > > > > > it.
> > > > > > > > > > >
> > > > > > > > > > > Thoughts?
> > > > > > > > > > >
> > > > > > > > > > > Haisheng Yuan
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > >
>

Re: Re: [DISCUSS] New Join Type: ANTI_NOTIN

Reply via email to