houqp opened a new pull request #605:
URL: https://github.com/apache/arrow-datafusion/pull/605


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   Follow up for 
https://github.com/apache/arrow-datafusion/pull/55#pullrequestreview-683118433.
   
   Closes #601.
   
   Also fixed a bug where `index_of_column` won't error out on duplicated 
fields.
   
    # Rationale for this change
   
   In MySQL and Postgres, Join with `On` constraints produces output with join 
column from both relations preserved. For example:
   
   ```sql
   SELECT * FROM test t1 JOIN test t2 ON t1.id = t2.id;
   ```
   produces:
   
   | id  | id  |
   | --- | --- |
   | 1   | 1   |
   | 2   | 2   |
   
   While join with `Using` constraints deduplicates the join column:
   
   ```sql
   SELECT * FROM test t1 JOIN test t2 USING (id);
   ```
   
   produces:
   
   | id  |
   | --- |
   | 1   |
   | 2   |
   
   However, in our current implementation, join column dedup is applied in all 
cases. This PR changes the behavior so it's consistent with MySQL and Postgres.
   
   Here comes the annoying part.
   
   Note that when it comes to join with `Using` constraint, users can still 
project join columns using relations from both sides. For example `SELECT 
t1.id, t2.id FROM test t1 JOIN test t2 USING (id)` produces the same output as 
`SELECT * FROM test t1 JOIN test t2 ON t1.id = t2.id`. This means for `Using` 
joins, we need to model a join column as a single shared column between both 
relations. Current `DFField` struct only allows a field/column to have a single 
qualifier, so I ended adding a new `shared_qualifiers` field to `DFField` 
struct to handle this edge-case. Our logical plan builder will be responsible 
for setting this field when building join queries with using constraints. 
During query optimization and planning, the `shared_qualifiers` field is used 
to look up column by name and qualifier.
   
   Other alternatives include changing the `qualifer` field of `DFField` to an 
option of enum to account for single qualifier and shared qualifiers. None of 
these approaches felt elegant to me. I am curious if anyone has ideas or 
suggestions on how to better implement this behavior.
   
   # What changes are included in this PR?
   
   * Expose JoinConstraints to physical plane
   * Expose JoinConstraints to ballista.proto
   * Unify JoinType enum between logical and physical planes
   * Added context execution tests to enforce semantics for `On` and `Using` 
joins
   * Refactored dfschema module to use `index_of_column_by_name` in `both 
index_of_column` and `field_with_qualified_name` methods.
   * Added `shared_qualifiers` filed to DFSchema struct
   
   # Are there any user-facing changes?
   
   Join with `On` constraints will now output joined columns from both 
relations without deduplication. Dedup is only applied to join with `Using` 
constraints.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to