Github user mgyucht commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21425#discussion_r190829076
  
    --- Diff: 
sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
 ---
    @@ -0,0 +1,123 @@
    +-- Unit tests for simple NOT IN predicate subquery across a single column.
    +--
    +-- ``col NOT IN expr'' is quite difficult to reason about. There are many 
edge cases, some of the
    +-- rules are confusing to the uninitiated, and precedence and treatment of 
null values is plain
    +-- unintuitive. To make this simpler to understand, I've come up with a 
plain English way of
    +-- describing the expected behavior of this query.
    +--
    +-- - If the subquery is empty (i.e. returns no rows), the row should be 
returned, regardless of
    +--   whether the filtered columns include nulls.
    +-- - If the subquery contains a result with all columns null, then the row 
should not be returned.
    +-- - If for all non-null filter columns there exists a row in the subquery 
in which each column
    +--   either
    +--   1. is equal to the corresponding filter column or
    +--   2. is null
    +--   then the row should not be returned. (This includes the case where 
all filter columns are
    +--   null.)
    +-- - Otherwise, the row should be returned.
    +--
    +-- Using these rules, we can come up with a set of test cases for 
single-column and multi-column
    +-- NOT IN test cases.
    +--
    +-- Test cases for single-column ``WHERE a NOT IN (SELECT c FROM r ...)'':
    +-- | # | does subquery include null? | is a null? | a = c? | row with a 
included in result? |
    +-- | 1 | empty                       |            |        | yes           
                 |
    +-- | 2 | yes                         |            |        | no            
                 |
    +-- | 3 | no                          | yes        |        | no            
                 |
    +-- | 4 | no                          | no         | yes    | no            
                 |
    +-- | 5 | no                          | no         | no     | yes           
                 |
    +--
    +-- There are also some considerations around correlated subqueries. 
Correlated subqueries can
    +-- cause cases 2, 3, or 4 to be reduced to case 1 by limiting the number 
of rows returned by the
    +-- subquery, so the row from the parent table should always be included in 
the output.
    +
    +CREATE TEMPORARY VIEW m AS SELECT * FROM VALUES
    +  (null, 1.0),
    +  (2, 3.0),
    +  (4, 5.0)
    +  AS m(a, b);
    +
    +CREATE TEMPORARY VIEW s AS SELECT * FROM VALUES
    +  (null, 1.0),
    +  (2, 3.0),
    +  (6, 7.0)
    +  AS s(c, d);
    +
    +  -- Uncorrelated NOT IN Subquery test cases
    --- End diff --
    
    Indenting these comments allows them to show up in the .out files. This is 
kind of a hack based on the preprocessing done by SQLQueryTestSuite. I prefer 
this because the result of the query is only visible in the .out files, so this 
way you can compare the comments to the output side-by-side.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to