vbarua opened a new issue, #12955:
URL: https://github.com/apache/datafusion/issues/12955

   ### Describe the bug
   
   According to the [SQL 
spec](https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt), when 
returning duplicate records from INTERSECT ALL the minimum number of copies 
from either input should be returned. Specifically:
   ```
               b) If a set operator is specified, then the result of applying
                 the set operator is a table containing the following rows:
   
                 i) Let R be a row that is a duplicate of some row in T1 or of
                    some row in T2 or both. Let m be the number of duplicates
                    of R in T1 and let n be the number of duplicates of R in
                    T2, where m � 0 and n � 0.
   
               ...            
   
               iii) If ALL is specified, then
   
                    Case:
   
                    1) If UNION is specified, then the number of duplicates of
                      R that T contains is (m + n).
   
                    2) If EXCEPT is specified, then the number of duplicates of
                      R that T contains is the maximum of (m - n) and 0.
   
                    3) If INTERSECT is specified, then the number of duplicates
                      of R that T contains is the minimum of m and n.
   ```
   
   DataFusion currently returns ALL copies of duplicated records from the RHS.
   
   ### To Reproduce
   
   The following query
   ```sql
   ➜  ~ datafusion-cli
   DataFusion CLI v42.0.0
   
   > SELECT * FROM VALUES ('a'), ('b'), ('b'), ('c'), ('c'), ('c')
   INTERSECT ALL
   SELECT * FROM VALUES ('b'), ('b'), ('b'), ('c'), ('c');
   +---------+
   | column1 |
   +---------+
   | b       |
   | b       |
   | c       |
   | c       |
   | c       |
   +---------+
   ```
   
   returns 3 copies of the record `('c')` which does not match the expected 
behaviour based on the spec.
   
   Note that only 2 copies of `('b')` are returned, so this only appears to 
affect the RHS.
   
   ### Expected behavior
   
   The above query should return 2 copies of the record `('c')`
   
   ### Additional context
   
   See DB Fiddle for Postgres which showcases the expected behaviour:
   https://www.db-fiddle.com/f/ja4BG5CfyEvak5ScoBwCZr/0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to