appletreeisyellow opened a new issue, #10881:
URL: https://github.com/apache/datafusion/issues/10881

   ### Describe the bug
   
   When the matching condition is on columns and combined with `OR`, the result 
retuned by `LEFT JOIN` is similar to `INNER JOIN`, which is incorrect.
   
   `LEFT JOIN` should not reduce the number of rows returned, i.e. it should 
return all the records from the left table, and the matched records from the 
right table. When there is no match, the result is a `NULL`, i.e. display as 
empty.
   
   `INNER JOIN` reduces the number of rows, i.e. it only returns the records 
that have matching values in both tables. 
   
   ### To Reproduce
   
   1. Use DataFusion CLI:
   ```shell
   # in dir datafusion/datafusion-cli
   $ cargo build
   $ ./target/debug/datafusion-cli
   ```
   
   2. Write sample data
   ```sql
   CREATE OR REPLACE TABLE employees(emp_id INT, name VARCHAR) AS VALUES (1, 
'Alice'), (2, 'Bob'), (3, 'Carol');
   CREATE OR REPLACE TABLE department(emp_id INT, department VARCHAR) AS VALUES 
(1, 'HR'), (3, 'Engineering'), (4, 'Sales');
   ```
   3. This is how `employees` table and `department` table look like
   ```sql
   > SELECT * FROM employees;
   +--------+-------+
   | emp_id | name  |
   +--------+-------+
   | 1      | Alice |
   | 2      | Bob   |
   | 3      | Carol |
   +--------+-------+
   3 row(s) fetched.
   Elapsed 0.009 seconds.
   
   > SELECT * FROM department;
   +--------+-------------+
   | emp_id | department  |
   +--------+-------------+
   | 1      | HR          |
   | 3      | Engineering |
   | 4      | Sales       |
   +--------+-------------+
   3 row(s) fetched.
   Elapsed 0.006 seconds.
   ```
   
   4. Query
   ```sql
   > SELECT e.emp_id, e.name, d.department
   FROM employees e
   LEFT JOIN department d
   ON (e.name = 'Alice' OR e.name = 'Bob');
   +--------+-------+-------------+
   | emp_id | name  | department  |
   +--------+-------+-------------+
   | 1      | Alice | HR          |
   | 1      | Alice | Engineering |
   | 1      | Alice | Sales       |
   | 2      | Bob   | HR          |
   | 2      | Bob   | Engineering |
   | 2      | Bob   | Sales       |
   +--------+-------+-------------+ <-- should have one more row for Carol
   6 row(s) fetched.
   Elapsed 0.013 seconds.
   
   > SELECT e.emp_id, e.name, d.department
   FROM employees e
   LEFT JOIN department d
   ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
   +--------+------+------------+
   | emp_id | name | department |
   +--------+------+------------+
   +--------+------+------------+ <-- should have three rows, 1 row for Alice, 
1 row for Bob, and 1 row for Carol
   0 row(s) fetched.
   Elapsed 0.014 seconds.
   
   > SELECT e.emp_id, e.name, d.department
   FROM employees e
   LEFT JOIN department d
   ON (e.name = 'Alice' OR e.name = 'NotExist');
   +--------+-------+-------------+
   | emp_id | name  | department  |
   +--------+-------+-------------+
   | 1      | Alice | HR          |
   | 1      | Alice | Engineering |
   | 1      | Alice | Sales       |
   +--------+-------+-------------+ <-- should have two more rows, 1 row for 
Bob and 1 row for Carol
   3 row(s) fetched.
   Elapsed 0.014 seconds.
   ```
   
   ### Expected behavior
   
   Postgres shows the expected results. 
   
   1. In psql cli, drop tables and create new tables with data
   ```sql
   DROP TABLE IF EXISTS employees;
   DROP TABLE IF EXISTS department;
   
   # create tables
   CREATE TABLE employees(emp_id INT, name VARCHAR);
   CREATE TABLE department(emp_id INT, dept_name VARCHAR);
   
   # write sample data
   INSERT INTO employees (emp_id, name) VALUES (1, 'Alice'), (2, 'Bob'), (3, 
'Carol');
   INSERT INTO department (emp_id, dept_name) VALUES (1, 'HR'), (3, 
'Engineering'), (4, 'Sales');
   ```
   
   2. This is how two tables look like:
   ```sql
   chunchun=# SELECT * FROM employees;
    emp_id | name
   --------+-------
         1 | Alice
         2 | Bob
         3 | Carol
   (3 rows)
   
   chunchun=# SELECT * FROM department;
    emp_id |  dept_name
   --------+-------------
         1 | HR
         3 | Engineering
         4 | Sales
   (3 rows)
   ```
   
   3. Query
   ```sql
   SELECT e.emp_id, e.name, d.department
   FROM employees AS e
   LEFT JOIN department AS d
   ON (e.name = 'Alice' OR e.name = 'Bob');
    emp_id | name  | department
   --------+-------+-------------
         1 | Alice | HR
         1 | Alice | Engineering
         1 | Alice | Sales
         2 | Bob   | HR
         2 | Bob   | Engineering
         2 | Bob   | Sales
         3 | Carol |
   (7 rows)
   
   SELECT e.emp_id, e.name, d.department
   FROM employees e
   LEFT JOIN department d
   ON (e.name = 'NotExist1' OR e.name = 'NotExist2');
    emp_id | name  | department
   --------+-------+------------
         1 | Alice |
         2 | Bob   |
         3 | Carol |
   (3 rows)
   
   SELECT e.emp_id, e.name, d.department
   FROM employees e
   LEFT JOIN department d
   ON (e.name = 'Alice' OR e.name = 'NotExist');
    emp_id | name  | department
   --------+-------+-------------
         1 | Alice | HR
         1 | Alice | Engineering
         1 | Alice | Sales
         2 | Bob   |
         3 | Carol |
   (5 rows)
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to