[I] Regression: `DataFrame::schema` returns incorrect schema for NATURAL JOIN [datafusion]

via GitHub Thu, 09 Jan 2025 07:07:35 -0800


DDtKey opened a new issue, #14058:
URL: https://github.com/apache/datafusion/issues/14058


   ### Describe the bug
   
   Affected Version: 42.x, 43.x, 44.x (regression since 41.x)
   
   The `DataFrame::schema` method returns a schema that includes all columns 
from the joined sources, including columns not present in the final output. 
This behavior is incorrect and inconsistent with the documented behavior:
   
   > Returns the DFSchema describing the output of this DataFrame.
   
   ### To Reproduce
   
   Simple MRE here:
   
   ```rust
   // Works for datafusion: 41.x and earlier
   // Failed for datafusion: 42.x and later (including 44.x)
   
   use datafusion::arrow::util::pretty;
   use datafusion::prelude::*;
   
   #[tokio::main]
   async fn main() -> datafusion::error::Result<()> {
       let ctx = SessionContext::new();
   
       // Create table1
       ctx.sql(
           r#"
           CREATE TABLE table1 AS
           SELECT * FROM (
               VALUES
               (1, 'a'),
               (2, 'b'),
               (3, 'c')
           ) AS t(id, value1)
           "#,
       )
       .await?;
   
       // Create table2
       ctx.sql(
           r#"
           CREATE TABLE table2 AS
           SELECT * FROM (
               VALUES
               (1, 'x'),
               (3, 'y'),
               (4, 'z')
           ) AS t(id, value2)
           "#,
       )
       .await?;
   
       // Execute NATURAL JOIN query
       let df = ctx.sql("SELECT * FROM table1 NATURAL JOIN table2").await?;
   
       // Incorrect schema includes all columns from both tables
       let schema = df.schema().as_arrow().clone();
       println!("Schema: {:?}", schema);
   
       // Output does not include all columns
       let result = df.collect().await?;
       pretty::print_batches(&result)?;
   
       let result_schema = result.first().unwrap().schema();
       assert_eq!(&schema, &*result_schema, "Schema mismatch");
   
       Ok(())
   }
   
   ```
   
   Deps:
   ```toml
   datafusion = "44.0.0"
   tokio = { version = "1", features = ["full"] }
   ```
   
   
   ### Expected behavior
   
   The schema returned by `DataFrame::schema` should match the structure of the 
output produced by `collect`/`collect_partitioned` and etc. Specifically:
   
   - Excluded columns from the result of a NATURAL JOIN should not appear in 
the schema.
   
   ___
   Or, if it was intended - the documentation should be aligned and be clear 
how to access the schema.
   However, I find previous behavior correct and useful (e.g - get schema 
before methods like `write_parquet`/`csv`/`json`)
   
   ### Additional context
   
   This is a regression, as the method previously **worked correctly in version 
41.x.x and earlier.** 
   
   Also, it probably points to the missing test coverage for particular 
code-paths. In a sense it's not enough to compare SQL execution results


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Regression: `DataFrame::schema` returns incorrect schema for NATURAL JOIN [datafusion]

Reply via email to