Github user mgyucht commented on a diff in the pull request:
https://github.com/apache/spark/pull/21425#discussion_r190842619
--- Diff:
sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
---
@@ -0,0 +1,123 @@
+-- Unit tests for simple NOT IN predicate subquery across a single column.
+--
+-- ``col NOT IN expr'' is quite difficult to reason about. There are many
edge cases, some of the
+-- rules are confusing to the uninitiated, and precedence and treatment of
null values is plain
+-- unintuitive. To make this simpler to understand, I've come up with a
plain English way of
+-- describing the expected behavior of this query.
+--
+-- - If the subquery is empty (i.e. returns no rows), the row should be
returned, regardless of
+-- whether the filtered columns include nulls.
+-- - If the subquery contains a result with all columns null, then the row
should not be returned.
+-- - If for all non-null filter columns there exists a row in the subquery
in which each column
+-- either
+-- 1. is equal to the corresponding filter column or
+-- 2. is null
+-- then the row should not be returned. (This includes the case where
all filter columns are
+-- null.)
+-- - Otherwise, the row should be returned.
+--
+-- Using these rules, we can come up with a set of test cases for
single-column and multi-column
+-- NOT IN test cases.
+--
+-- Test cases for single-column ``WHERE a NOT IN (SELECT c FROM r ...)'':
+-- | # | does subquery include null? | is a null? | a = c? | row with a
included in result? |
+-- | 1 | empty | | | yes
|
+-- | 2 | yes | | | no
|
+-- | 3 | no | yes | | no
|
+-- | 4 | no | no | yes | no
|
+-- | 5 | no | no | no | yes
|
+--
+-- There are also some considerations around correlated subqueries.
Correlated subqueries can
+-- cause cases 2, 3, or 4 to be reduced to case 1 by limiting the number
of rows returned by the
+-- subquery, so the row from the parent table should always be included in
the output.
+
+CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES
+ (null, 1.0),
+ (2, 3.0),
+ (4, 5.0)
+ AS m(a, b);
+
+CREATE TEMPORARY VIEW s AS SELECT * FROM VALUES
+ (null, 1.0),
+ (2, 3.0),
+ (6, 7.0)
+ AS s(c, d);
+
+ -- Uncorrelated NOT IN Subquery test cases
--- End diff --
No, it gets bundled with the query immediately after it. This is why the
`;` is on the line after the last line of the query, so that the comment shows
up with that query rather than with the following one.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]