alamb opened a new issue, #18337:
URL: https://github.com/apache/datafusion/issues/18337

   ### Describe the bug
   
   When writing queries on parquet files with field metadata and not stripping 
that
   metadata, DataFusion errors out with the above error.
   
   
   ### To Reproduce
   
   
   Repro
   ```sql
   -- First, ensure that parquet metadata is not skipped (it is skipped by 
default)
   > set datafusion.execution.parquet.skip_metadata = false;
   
   SELECT
     'foo' AS name,
     COUNT(
       CASE
         WHEN prev_value = false AND value = TRUE THEN 1
         ELSE NULL
         END
        ) AS count_true_rises
   FROM
     (
       SELECT
         value,
         LAG(value) OVER (ORDER BY time ) AS prev_value
       FROM
         'repro.parquet'
   );
   ```
   
   Results in
   ```
   Internal error: Physical input schema should be the same as the one 
converted from logical input schema. Differences: .
   This issue was likely caused by a bug in DataFusion's code. Please help us 
to resolve this by filing a bug report in our issue tracker: 
https://github.com/apache/datafusion/issues
   ```
   
   
   I made the parquet file available here: 
   
   
[parquet-with-metadata.zip](https://github.com/user-attachments/files/23193734/parquet-with-metadata.zip)
   
   Here is the code to generate the parquet file (I am not sure how to create 
parquet files with metadata otherwise):
   
   <details><summary>Details</summary>
   <p>
   
   
   ```rust
   use std::collections::HashMap;
   use std::fs::File;
   use std::sync::Arc;
   use arrow::array::{BooleanArray, RecordBatch, TimestampNanosecondArray};
   use arrow::datatypes::{DataType, Field, Schema, SchemaRef, TimeUnit};
   
   #[tokio::main]
   async fn main() -> Result<(), Box<dyn std::error::Error>> {
       // write a parquet file which has a metadata
       let mut metadata = HashMap::new();
       metadata.insert(String::from("year"), String::from("2015"));
       let schema: SchemaRef = Arc::new(Schema::new(vec![
           Field::new("time", DataType::Timestamp(TimeUnit::Nanosecond, None), 
false),
           Field::new("value", DataType::Boolean, false)
               .with_metadata(metadata),
       ]));
   
       let time = 
TimestampNanosecondArray::from(vec![1_420_070_400_000_000_000i64, 
1_420_070_401_000_000_000i64]);
       let value = BooleanArray::from(vec![true, false]);
       let batch = RecordBatch::try_new(schema.clone(), vec![
           Arc::new(time),
           Arc::new(value),
       ])?;
   
   
       println!("Writing parquet file with metadata repro.parquet...");
       let writer = File::create("repro.parquet")?;
       let mut arrow_writer = parquet::arrow::ArrowWriter::try_new(
           writer,
           schema.clone(),
           None,
       )?;
       arrow_writer.write(&batch)?;
       arrow_writer.close()?;
   
       Ok(())
   }
   ```
   
   
   </p>
   </details> 
   
   
   Note this is all the more confusing because the error lists no differences
   ```
   ...  converted from logical input schema. Differences: . <-- no differences 
are listed!!!
   ```
   
   The difference is the metadata on the `value` field.
   
   
   ### Expected behavior
   
   I expect the query to pass without error
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to