[GitHub] [arrow-datafusion] sergiimk opened a new issue, #6463: UNION ALL schema harmonization failure in subquery/view

via GitHub Fri, 26 May 2023 18:29:05 -0700


sergiimk opened a new issue, #6463:
URL: https://github.com/apache/arrow-datafusion/issues/6463


   ### Describe the bug
   
   UNION ALL of tables sourced from two parquet files with columns that differ 
in nullability results in errors like:
   ```rust
   ParquetError(ArrowError("Record batch schema does not match writer schema"))
   ```
   and
   ```rust
   ArrowError(InvalidArgumentError("batches[0] schema is different with 
argument schema. ..."))
   ```
   
   ### To Reproduce
   
   I have two parquet files 
([data.tar.gz](https://github.com/apache/arrow-datafusion/files/11580673/data.tar.gz))
 with schemas that only differ in nullability:
   
   alberta.paruet:
   ```
   message arrow_schema {
     required int64 offset;
     required int64 system_time (TIMESTAMP(NANOS,false));
     optional int64 reported_date (TIMESTAMP(NANOS,false));
     optional int64 id;
     required binary gender (STRING);
     required binary age_group (STRING);
     optional binary location (STRING);
   }
   ```
   british-columbia.parquet:
   ```
   message arrow_schema {
     required int64 offset;
     required int64 system_time (TIMESTAMP(NANOS,false));
     optional int64 reported_date (TIMESTAMP(NANOS,false));
     optional int32 id;
     optional binary gender (STRING);
     required binary age_group (STRING);
     optional binary location (STRING);
   }
   ```
   
   I run the following program:
   ```rust
   use datafusion::prelude::*;
   let ctx = SessionContext::new();
   
   let df = ctx
       .sql(
           r#"
           SELECT
               'AB' as province,
               id,
               reported_date,
               gender,
               location
           FROM ab
           UNION ALL
           SELECT
               'BC' as province,
               id,
               reported_date,
               gender,
               location
           FROM bc
           "#,
       )
       .await
       .unwrap();
   
   println!("{:#?}", df.schema());
   df.clone().show_limit(10).await.unwrap();
   
   let tempdir = tempfile::tempdir().unwrap();
   df.write_parquet(&format!("{}/foo", tempdir.path().display()), None)
       .await
       .unwrap();
   ```
   and it **succeeds**!
   
   If I then modify it to use a **subquery**:
   ```sql
   SELECT * FROM (
       SELECT
           'AB' as province,
           id,
           reported_date,
           gender,
           location
       FROM ab
       UNION ALL
       SELECT
           'BC' as province,
           id,
           reported_date,
           gender,
           location
       FROM bc
   )
   ```
   I get error:
   ```rust
   ParquetError(ArrowError("Record batch schema does not match writer schema"))
   ```
   Errors happen also in the case of `CREATE VIEW`.
   
   Note that dataframe schema correctly harmonizes the `gender` fields to be 
`nullable: true` - the error happens only at the stage of writing output to a 
parquet file.
   
   ### Expected behavior
   
   Parquet file is written with harmonized schema successfully.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] sergiimk opened a new issue, #6463: UNION ALL schema harmonization failure in subquery/view

Reply via email to