Actually this is a very interesting topic not only for the particular case of NOT IN subqueries demonstrated at the beginning of this thread but for other kinds of subqueries as well.
If I understood well the initial proposal of Haisheng is to introduce another special kind of join at the logical level (algebra) specific to one kind of NOT IN subqueries. Then this would be a 1-1 translation to a physical join operator capable of handling joins in the presence of nulls such as the null-aware joins advocated by Oracle [1]. Undeniably there are performance benefits of using this kind of null aware algorithms as it is also shown in the experiments of the respective paper [1]. Other databases such as SQL server choose to model this case as regular ANTI JOIN by adding disjunction predicates. This representation looks nice at the logical level but without further transformation this can only be evaluated by a nested loop join algorithm (NLJ) so it is not that great. Nevertheless databases with special null-aware joins could transform this disjunctive anti-join to a physical operator that is more efficient that a NLJ. Postgres has also been considering NOT IN optimizations for some time now [3] but I am not sure if it is implemented at the moment. Another approach worth mentioning is the one implemented in HyPer [4]. The authors introduce special join operators (namely marked join) for NOT IN, EXISTS and other kinds of subquery clauses at the logical (algebra) level that are later mapped to regular or null-aware anti joins. The fact that I like in these non-standard join operators is that they can transform complex subquery constructs to re-orderable joins. To sum up I agree (as everybody else I guess) that we need a join abstraction for NOT IN and other subquery constructs in our logical algebra. Which one is better from those mentioned before I am not sure but I have to say that I like the genericity in the approach from HyPer. (For Calcite, this most likely means a new RelNode as Ruben suggested). On the physical side things are easier since everybody has some ideas on how the algorithm should work. Julian gave an example and also more details can be found in the papers mentioned previously and the web. At the moment we should focus on the logical side of things and decide which representation we would like to adopt. Best, Stamatis [1] Enhanced Subquery optimizations in Oracle ( http://www.vldb.org/pvldb/vol2/vldb09-423.pdf) [2] https://sqlchitchat.com/sqldev/tsql/semi-joins-in-sql-server/ [3] https://www.postgresql.org/message-id/flat/cakjs1f82pqjqe3wt9_xremxyg20aokhc-xqkkzg_yma7jvj...@mail.gmail.com [4] The complete story of joins (in HyPer) https://pdfs.semanticscholar.org/ff3b/b9c6f3c71e6c10f12e45f600af72c2cd57eb.pdf On Sat, Jul 25, 2020 at 8:05 AM Haisheng Yuan <[email protected]> wrote: > I am not sure I get your idea. > What will the logical plan and physical plan look like for the following > query? > SELECT * FROM foo WHERE a NOT IN (SELECT b FROM bar); -- bar.b is nullable > > > On 2020/07/23 01:35:44, Julian Hyde <[email protected]> wrote: > > How about a semi-join algorithm that adds column that hold the nature of > the match? This algorithm can be used to evaluate 3-valued IN and 3-valued > NOT IN queries. > > > > Generalizing further, it could compute any “ANY (predicate)” or “ALL > (predicate)” condition as a column with values TRUE/FALSE/UNKNOWN. For > instance “empno NOT IN …” is equivalent to “ALL(empno <> …)”. > > > > Here is an example. > > > > Emp table: > > > > empno deptno mgr > > ===== ====== ==== > > 100 10 NULL > > 101 10 100 > > 102 20 101 > > 103 30 100 > > 104 30 103 > > 105 40 NULL > > > > > > The semi-join algorithm calculates the following intermediate result SJ: > > > > empno deptno mgrs match anyNulls allNulls > > ===== ====== =========== ===== ======== ======== > > 100 10 [NULL, 100] TRUE TRUE FALSE > > 101 10 [NULL, 100] FALSE TRUE FALSE > > 102 20 [101] FALSE FALSE FALSE > > 103 30 [100, 103] TRUE FALSE FALSE > > 104 30 [100, 103] FALSE FALSE TRUE > > 105 40 [NULL] FALSE TRUE TRUE > > > > So it can answer the following query: > > > > SELECT empno, > > deptno, > > empno IN (SELECT mgr FROM Emp > > WHERE deptno = e.deptno) AS “in", > > empno NOT IN (SELECT mgr FROM Emp > > WHERE deptno = e.deptno) AS “notIn" > > FROM Emp AS e; > > > > empno deptno in notIn > > ===== ====== ======= ===== > > 100 10 TRUE FALSE > > 101 10 FALSE UNKNOWN > > 102 20 FALSE TRUE > > 103 30 TRUE FALSE > > 104 40 FALSE FALSE > > 105 50 UNKNOWN UNKNOWN > > > > by mapping it to the output of the semi-join algorithm: > > > > SELECT empno, deptno, > > case when allNulls then unknown else match end AS “in”, > > case when anyNulls then unknown else not match end AS “notIn” > > FROM SJ > > > > > > > > > > > > > > > > > > > On Jul 22, 2020, at 8:57 AM, Vladimir Sitnikov < > [email protected]> wrote: > > > > > > Julian> Vladimir said he *expected* Oracle would implement (3-valued) > NOT > > > IN efficiently. (Back in the day, when I was at Oracle, they certainly > did > > > not.) Does anyone have any evidence that they do? > > > > > > Well, Oracle has "null aware" joins since Oracle 11g which is more > than 10 > > > years old. > > > I have not tested the actual performance of the `null-aware join`, > however, > > > it would be extremely surprising, if `null-aware join` was less > efficient > > > than "manually crafted aggregate + join + join + whatever". > > > > > > That is why I think it is important to be able to generate regular > "not in" > > > plans. > > > > > > Vladimir > > > > >
