alamb opened a new issue, #18337:
URL: https://github.com/apache/datafusion/issues/18337
### Describe the bug
When writing queries on parquet files with field metadata and not stripping
that
metadata, DataFusion errors out with the above error.
### To Reproduce
Repro
```sql
-- First, ensure that parquet metadata is not skipped (it is skipped by
default)
> set datafusion.execution.parquet.skip_metadata = false;
SELECT
'foo' AS name,
COUNT(
CASE
WHEN prev_value = false AND value = TRUE THEN 1
ELSE NULL
END
) AS count_true_rises
FROM
(
SELECT
value,
LAG(value) OVER (ORDER BY time ) AS prev_value
FROM
'repro.parquet'
);
```
Results in
```
Internal error: Physical input schema should be the same as the one
converted from logical input schema. Differences: .
This issue was likely caused by a bug in DataFusion's code. Please help us
to resolve this by filing a bug report in our issue tracker:
https://github.com/apache/datafusion/issues
```
I made the parquet file available here:
[parquet-with-metadata.zip](https://github.com/user-attachments/files/23193734/parquet-with-metadata.zip)
Here is the code to generate the parquet file (I am not sure how to create
parquet files with metadata otherwise):
<details><summary>Details</summary>
<p>
```rust
use std::collections::HashMap;
use std::fs::File;
use std::sync::Arc;
use arrow::array::{BooleanArray, RecordBatch, TimestampNanosecondArray};
use arrow::datatypes::{DataType, Field, Schema, SchemaRef, TimeUnit};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// write a parquet file which has a metadata
let mut metadata = HashMap::new();
metadata.insert(String::from("year"), String::from("2015"));
let schema: SchemaRef = Arc::new(Schema::new(vec![
Field::new("time", DataType::Timestamp(TimeUnit::Nanosecond, None),
false),
Field::new("value", DataType::Boolean, false)
.with_metadata(metadata),
]));
let time =
TimestampNanosecondArray::from(vec![1_420_070_400_000_000_000i64,
1_420_070_401_000_000_000i64]);
let value = BooleanArray::from(vec![true, false]);
let batch = RecordBatch::try_new(schema.clone(), vec![
Arc::new(time),
Arc::new(value),
])?;
println!("Writing parquet file with metadata repro.parquet...");
let writer = File::create("repro.parquet")?;
let mut arrow_writer = parquet::arrow::ArrowWriter::try_new(
writer,
schema.clone(),
None,
)?;
arrow_writer.write(&batch)?;
arrow_writer.close()?;
Ok(())
}
```
</p>
</details>
Note this is all the more confusing because the error lists no differences
```
... converted from logical input schema. Differences: . <-- no differences
are listed!!!
```
The difference is the metadata on the `value` field.
### Expected behavior
I expect the query to pass without error
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]