timsaucer opened a new issue, #18537:
URL: https://github.com/apache/datafusion/issues/18537
### Describe the bug
From my reading of documentation a `JOIN USING` call should remove duplicate
fields on the output.
In our current implementation `DataFrame` does not have a `join_using`
function. I wrote a simple implementation based on the `join_on` approach. When
I do so I get duplicate output columns.
### To Reproduce
```rust
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
let rb1 = record_batch!(("a", Int32, [1, 2, 3]), ("b", UInt32, [4, 5,
6]))?;
let rb2 = record_batch!(("a", Int32, [1, 2, 3]), ("b", Utf8, ["x", "y",
"z"]))?;
let _ = ctx.register_batch("left_table", rb1)?;
let _ = ctx.register_batch("right_table", rb2)?;
let (state, left_plan) = ctx.table("left_table").await?.into_parts();
let (_, right_plan) = ctx.table("right_table").await?.into_parts();
let plan = LogicalPlanBuilder::new(left_plan)
.join_using(right_plan, JoinType::Inner,
vec![Column::from_name("a")])?
.build()?;
DataFrame::new(state, plan).show().await?;
Ok(())
}
```
This generates the output:
```
+---+---+---+---+
| a | b | a | b |
+---+---+---+---+
| 1 | 4 | 1 | x |
| 2 | 5 | 2 | y |
| 3 | 6 | 3 | z |
+---+---+---+---+
```
### Expected behavior
I would expect from reading the documentation to only get one column with
name `a`
### Additional context
Here is a relevant section from the PostgreSQL documentation:
https://www.postgresql.org/docs/current/queries-table-expressions.html
> Furthermore, the output of JOIN USING suppresses redundant columns: there
is no need to print both of the matched columns, since they must have equal
values. While JOIN ON produces all columns from T1 followed by all columns from
T2, JOIN USING produces one output column for each of the listed column pairs
(in the listed order), followed by any remaining columns from T1, followed by
any remaining columns from T2.
Additionally in our documentation for the `join_using` function the
docstring says:
```rust
/// Apply a join with using constraint, which duplicates all join
columns in output schema.
pub fn join_using(
self,
right: LogicalPlan,
join_type: JoinType,
using_keys: Vec<Column>,
) -> Result<Self>
```
It would appear the code is working as the authors of the function intended,
but it also appears that this doesn't match industry expectations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]