Actually this is a very interesting topic not only for the particular case
of NOT IN subqueries demonstrated at the beginning of this thread but for
other kinds of subqueries as well.

If I understood well the initial proposal of Haisheng is to introduce
another special kind of join at the logical level (algebra) specific to one
kind of NOT IN subqueries.
Then this would be a 1-1 translation to a physical join operator capable of
handling joins in the presence of nulls such as the null-aware joins
advocated by Oracle [1].
Undeniably there are performance benefits of using this kind of null aware
algorithms as it is also shown in the experiments of the respective paper
[1].

Other databases such as SQL server choose to model this case as regular
ANTI JOIN by adding disjunction predicates. This representation looks nice
at the logical level
but without further transformation this can only be evaluated by a nested
loop join algorithm (NLJ) so it is not that great.
Nevertheless databases with special null-aware joins could transform this
disjunctive anti-join to a physical operator that is more efficient that a
NLJ.

Postgres has also been considering NOT IN optimizations for some time now
[3] but I am not sure if it is implemented at the moment.

Another approach worth mentioning is the one implemented in HyPer [4]. The
authors introduce special join operators (namely marked join) for NOT IN,
EXISTS and other kinds of subquery clauses
at the logical (algebra) level that are later mapped to regular or
null-aware anti joins. The fact that I like in these non-standard join
operators is that they can transform complex subquery constructs to
re-orderable joins.

To sum up I agree (as everybody else I guess) that we need a join
abstraction for NOT IN and other subquery constructs in our logical
algebra.
Which one is better from those mentioned before I am not sure but I have to
say that I like the genericity in the approach from HyPer.
(For Calcite, this most likely means a new RelNode as Ruben suggested).

On the physical side things are easier since everybody has some ideas on
how the algorithm should work.
Julian gave an example and also more details can be found in the papers
mentioned previously and the web.

At the moment we should focus on the logical side of things and decide
which representation we would like to adopt.

Best,
Stamatis

[1] Enhanced Subquery optimizations in Oracle (
http://www.vldb.org/pvldb/vol2/vldb09-423.pdf)
[2] https://sqlchitchat.com/sqldev/tsql/semi-joins-in-sql-server/
[3]
https://www.postgresql.org/message-id/flat/cakjs1f82pqjqe3wt9_xremxyg20aokhc-xqkkzg_yma7jvj...@mail.gmail.com
[4] The complete story of joins (in HyPer)
https://pdfs.semanticscholar.org/ff3b/b9c6f3c71e6c10f12e45f600af72c2cd57eb.pdf




On Sat, Jul 25, 2020 at 8:05 AM Haisheng Yuan <[email protected]> wrote:

> I am not sure I get your idea.
> What will the logical plan and physical plan look like for the following
> query?
> SELECT * FROM foo WHERE a NOT IN (SELECT b FROM bar); -- bar.b is nullable
>
>
> On 2020/07/23 01:35:44, Julian Hyde <[email protected]> wrote:
> > How about a semi-join algorithm that adds column that hold the nature of
> the match? This algorithm can be used to evaluate 3-valued IN and 3-valued
> NOT IN queries.
> >
> > Generalizing further, it could compute any “ANY (predicate)” or “ALL
> (predicate)” condition as a column with values TRUE/FALSE/UNKNOWN. For
> instance “empno NOT IN …” is equivalent to “ALL(empno <> …)”.
> >
> > Here is an example.
> >
> > Emp table:
> >
> > empno deptno  mgr
> > ===== ====== ====
> >   100     10 NULL
> >   101     10  100
> >   102     20  101
> >   103     30  100
> >   104     30  103
> >   105     40 NULL
> >
> >
> > The semi-join algorithm calculates the following intermediate result SJ:
> >
> > empno deptno mgrs        match anyNulls allNulls
> > ===== ====== =========== ===== ======== ========
> >   100     10 [NULL, 100] TRUE  TRUE     FALSE
> >   101     10 [NULL, 100] FALSE TRUE     FALSE
> >   102     20 [101]       FALSE FALSE    FALSE
> >   103     30 [100, 103]  TRUE  FALSE    FALSE
> >   104     30 [100, 103]  FALSE FALSE    TRUE
> >   105     40 [NULL]      FALSE TRUE     TRUE
> >
> > So it can answer the following query:
> >
> > SELECT empno,
> >   deptno,
> >   empno IN (SELECT mgr FROM Emp
> >             WHERE deptno = e.deptno) AS “in",
> >   empno NOT IN (SELECT mgr FROM Emp
> >             WHERE deptno = e.deptno) AS “notIn"
> > FROM Emp AS e;
> >
> > empno deptno in      notIn
> > ===== ====== ======= =====
> >   100     10 TRUE    FALSE
> >   101     10 FALSE   UNKNOWN
> >   102     20 FALSE   TRUE
> >   103     30 TRUE    FALSE
> >   104     40 FALSE   FALSE
> >   105     50 UNKNOWN UNKNOWN
> >
> > by mapping it to the output of the semi-join algorithm:
> >
> > SELECT empno, deptno,
> >   case when allNulls then unknown else match end AS “in”,
> >   case when anyNulls then unknown else not match end AS “notIn”
> > FROM SJ
> >
> >
> >
> >
> >
> >
> >
> >
> > > On Jul 22, 2020, at 8:57 AM, Vladimir Sitnikov <
> [email protected]> wrote:
> > >
> > > Julian> Vladimir said he *expected* Oracle would implement (3-valued)
> NOT
> > > IN efficiently. (Back in the day, when I was at Oracle, they certainly
> did
> > > not.) Does anyone have any evidence that they do?
> > >
> > > Well, Oracle has "null aware" joins since Oracle 11g which is more
> than 10
> > > years old.
> > > I have not tested the actual performance of the `null-aware join`,
> however,
> > > it would be extremely surprising, if `null-aware join` was less
> efficient
> > > than "manually crafted aggregate + join + join + whatever".
> > >
> > > That is why I think it is important to be able to generate regular
> "not in"
> > > plans.
> > >
> > > Vladimir
> >
> >
>

Reply via email to