houqp opened a new pull request #605: URL: https://github.com/apache/arrow-datafusion/pull/605
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Follow up for https://github.com/apache/arrow-datafusion/pull/55#pullrequestreview-683118433. Closes #601. Also fixed a bug where `index_of_column` won't error out on duplicated fields. # Rationale for this change In MySQL and Postgres, Join with `On` constraints produces output with join column from both relations preserved. For example: ```sql SELECT * FROM test t1 JOIN test t2 ON t1.id = t2.id; ``` produces: | id | id | | --- | --- | | 1 | 1 | | 2 | 2 | While join with `Using` constraints deduplicates the join column: ```sql SELECT * FROM test t1 JOIN test t2 USING (id); ``` produces: | id | | --- | | 1 | | 2 | However, in our current implementation, join column dedup is applied in all cases. This PR changes the behavior so it's consistent with MySQL and Postgres. Here comes the annoying part. Note that when it comes to join with `Using` constraint, users can still project join columns using relations from both sides. For example `SELECT t1.id, t2.id FROM test t1 JOIN test t2 USING (id)` produces the same output as `SELECT * FROM test t1 JOIN test t2 ON t1.id = t2.id`. This means for `Using` joins, we need to model a join column as a single shared column between both relations. Current `DFField` struct only allows a field/column to have a single qualifier, so I ended adding a new `shared_qualifiers` field to `DFField` struct to handle this edge-case. Our logical plan builder will be responsible for setting this field when building join queries with using constraints. During query optimization and planning, the `shared_qualifiers` field is used to look up column by name and qualifier. Other alternatives include changing the `qualifer` field of `DFField` to an option of enum to account for single qualifier and shared qualifiers. None of these approaches felt elegant to me. I am curious if anyone has ideas or suggestions on how to better implement this behavior. # What changes are included in this PR? * Expose JoinConstraints to physical plane * Expose JoinConstraints to ballista.proto * Unify JoinType enum between logical and physical planes * Added context execution tests to enforce semantics for `On` and `Using` joins * Refactored dfschema module to use `index_of_column_by_name` in `both index_of_column` and `field_with_qualified_name` methods. * Added `shared_qualifiers` filed to DFSchema struct # Are there any user-facing changes? Join with `On` constraints will now output joined columns from both relations without deduplication. Dedup is only applied to join with `Using` constraints. <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
