DDtKey opened a new issue, #14058:
URL: https://github.com/apache/datafusion/issues/14058
### Describe the bug
Affected Version: 42.x, 43.x, 44.x (regression since 41.x)
The `DataFrame::schema` method returns a schema that includes all columns
from the joined sources, including columns not present in the final output.
This behavior is incorrect and inconsistent with the documented behavior:
> Returns the DFSchema describing the output of this DataFrame.
### To Reproduce
Simple MRE here:
```rust
// Works for datafusion: 41.x and earlier
// Failed for datafusion: 42.x and later (including 44.x)
use datafusion::arrow::util::pretty;
use datafusion::prelude::*;
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
let ctx = SessionContext::new();
// Create table1
ctx.sql(
r#"
CREATE TABLE table1 AS
SELECT * FROM (
VALUES
(1, 'a'),
(2, 'b'),
(3, 'c')
) AS t(id, value1)
"#,
)
.await?;
// Create table2
ctx.sql(
r#"
CREATE TABLE table2 AS
SELECT * FROM (
VALUES
(1, 'x'),
(3, 'y'),
(4, 'z')
) AS t(id, value2)
"#,
)
.await?;
// Execute NATURAL JOIN query
let df = ctx.sql("SELECT * FROM table1 NATURAL JOIN table2").await?;
// Incorrect schema includes all columns from both tables
let schema = df.schema().as_arrow().clone();
println!("Schema: {:?}", schema);
// Output does not include all columns
let result = df.collect().await?;
pretty::print_batches(&result)?;
let result_schema = result.first().unwrap().schema();
assert_eq!(&schema, &*result_schema, "Schema mismatch");
Ok(())
}
```
Deps:
```toml
datafusion = "44.0.0"
tokio = { version = "1", features = ["full"] }
```
### Expected behavior
The schema returned by `DataFrame::schema` should match the structure of the
output produced by `collect`/`collect_partitioned` and etc. Specifically:
- Excluded columns from the result of a NATURAL JOIN should not appear in
the schema.
___
Or, if it was intended - the documentation should be aligned and be clear
how to access the schema.
However, I find previous behavior correct and useful (e.g - get schema
before methods like `write_parquet`/`csv`/`json`)
### Additional context
This is a regression, as the method previously **worked correctly in version
41.x.x and earlier.**
Also, it probably points to the missing test coverage for particular
code-paths. In a sense it's not enough to compare SQL execution results
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]