Thanks Jinpeng for providing a good example for not in subquery.

I 100% agree with you that correlated query won't be represented by ANTI_NOTIN 
join type, and it is not the proposal's intention. Here what we are discussing 
is not to use ANTI_NOTIN to represent all the NOT IN sub-queries, that is 
impossible. Instead, if you take a close look at the example query, it is a 
simple uncorrelated NOT IN sub-query. That is the target. Let's focus on that 
kind of query, ask ourselves this question:  Can such a simple query be 
transformed into a ANTI join to make the plan efficient?

Sadly no. The reality is that this kind of query is not uncommon, may be much 
more common than correlated NOT IN sub-queries.


Reply to Julian:
> > How about making a sub-query type (in RexSubQuery), so it is gone
> > before we reach algebra.
It will be nice to have a NOT_IN subquery type, without expanding NOT IN to 
NOT(IN....). 
However, if there is no  ANTI_NOTIN in the join type (without reaching 
algebra), does that mean the optimizer still can't generate efficient plan for 
simple NOT IN sub-queries?

> > ANTI_NOTIN is a terrible name. ANTI means 'opposite' to ANTI_NOTIN is
> > the opposite of NOT IN?!
It depends how people interpret ANTI. You interpret it as "opposite", I 
interpret it as "ANTI JOIN", means "anti join for NOT IN, instead of NOT 
EXISTS". But it is just a naming issue, I am OK to change it whatever name that 
makes sense to the community, as long as it can convey the meaning.

Thanks,
Haisheng

On 2020/07/21 03:02:20, Jinpeng Wu <wjpabc...@gmail.com> wrote: 
> Hi.
> 
> In some SQL engine, the query
> select * from A where c1 not in ( select c1 from B where B.c2 = A.c2);
> is transformed to a plan like
> select * from A LEFT ANTI JOIN B on A.c2 = B.c2 AND (A.c1 = B.c1 OR A.c1 is
> null OR B.c1 is null);
> 
> Here, the "LEFT ANTI JOIN" is nothing more than traditional definition. One
> thing seems to be a problem is that A.c1 cannot be used as a join key in
> the new plan. However, the problem is also there for ANTI_NOTIN, and even
> other NOT-IN-SUBQUERY physical implementations.
> 
> Thanks,
> Qiupeng
> 
> On Tue, Jul 21, 2020 at 5:30 AM Julian Hyde <jh...@apache.org> wrote:
> 
> > How about making a sub-query type (in RexSubQuery), so it is gone
> > before we reach algebra.
> >
> > ANTI_NOTIN is a terrible name. ANTI means 'opposite' to ANTI_NOTIN is
> > the opposite of NOT IN?!
> >
> > On Mon, Jul 20, 2020 at 1:00 PM Haisheng Yuan <hy...@apache.org> wrote:
> > >
> > > Typo:
> > > We can just add a security guard saying that it is supported.
> > > Should be
> > > We can just add a security guard saying that it is NOT supported.
> > >
> > > On 2020/07/20 19:57:34, Haisheng Yuan <hy...@apache.org> wrote:
> > > > I am not sure I got your implication by "pollute". If you mean
> > changes, yes, it requires some changes in rules. Do we need to change
> > enumerables? Not necessary. We can just add a security guard saying that it
> > is supported. Not everyone requires the Enumerable operators to support
> > everything. More importantly, currently there is no logic or rules to
> > translate sub-query directly to SEMI/ANTI joins, let alone translating
> > directly to ANTI_NOTIN. Currently NOT IN is expanded to NOT(IN ...) before
> > entering RelNode land. That means we don't even have the chance to generate
> > the NOT IN anti join. Is that still a concern?
> > > >
> > > > Even if some day, some contributor extends Calcite's parser and
> > SubqueryRemovalRule to be able to  transform NOT_IN subquery into NOT IN
> > anti join, we still have chance to disable it. Is that still a concern?
> > > >
> > > > There are many ways to play it safe.
> > > >
> > > > > Brainstorming: maybe we could consider it as a separate logical
> > operator
> > > > > (with its corresponding enumerable implementation)?
> > > > It doesn't sound cool. It requires much more work. You have to
> > duplicate all the rules, metadata handler that deal with LogicalJoin, and
> > for some rule that matches Join base class, you have to check it is a
> > LogicalJoin or the logical operator for ANTI_NOTIN.
> > > >
> > > > On 2020/07/20 08:28:42, Ruben Q L <rube...@gmail.com> wrote:
> > > > > I have some concerns that this new type would "pollute" the existing
> > Join
> > > > > logic, rules and enumerable implementations.
> > > > >
> > > > > Brainstorming: maybe we could consider it as a separate logical
> > operator
> > > > > (with its corresponding enumerable implementation)?
> > > > >
> > > > >
> > > > > Le lun. 20 juil. 2020 à 06:08, Haisheng Yuan <h.y...@alibaba-inc.com>
> > a
> > > > > écrit :
> > > > >
> > > > > > I agree that NOT IN is toxic, and it is error-prone.
> > > > > > But you can't prevent people writing SQL with not in sub-queries,
> > would
> > > > > > you rather let optimizer generate inefficient plan?
> > > > > >
> > > > > > - Haisheng
> > > > > >
> > > > > > ------------------------------------------------------------------
> > > > > > 发件人:Julian Hyde<jh...@apache.org>
> > > > > > 日 期:2020年07月20日 11:56:35
> > > > > > 收件人:dev@calcite.apache.org<dev@calcite.apache.org>
> > > > > > 主 题:Re: [DISCUSS] New Join Type: ANTI_NOTIN
> > > > > >
> > > > > > Yuck!
> > > > > >
> > > > > > NOT IN is toxic. I'd rather keep it out of the algebra.
> > > > > >
> > > > > > On Sun, Jul 19, 2020 at 8:24 PM Haisheng Yuan <hy...@apache.org>
> > wrote:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Currently, JoinRelType.ANTI only represents NOT_EXISTS subquery
> > (thanks
> > > > > > to Ruben for reminding).
> > > > > > > For some simple boolean context NOT_IN subquery, we can't
> > transform it
> > > > > > to ANTI join. e.g.:
> > > > > > >
> > > > > > > SELECT * FROM foo WHERE a NOT IN (SELECT b FROM bar); -- bar.b is
> > > > > > nullable
> > > > > > >
> > > > > > > Because if there is a null value in the results of subquery, the
> > NOT IN
> > > > > > predicate will return false, the whole query returns empty. And in
> > Calcite,
> > > > > > the plan for this kind of query is inefficient.
> > > > > > >
> > > > > > > If we have ANTI_NOTIN to represent this kind of join, we can
> > generate
> > > > > > more efficient plan, as long as the query executor support it.
> > > > > > >
> > > > > > > Thoughts?
> > > > > > >
> > > > > > > Haisheng Yuan
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> 

Reply via email to