wiedld opened a new issue, #11770: URL: https://github.com/apache/datafusion/issues/11770
### Describe the bug We have been using two parquet writers: ArrowWriter vs ParquetSink (parallelized writes). We discovered a bug where the ArrowWriter [includes the arrow schema (by default) in the parquet metadata](https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_writer/mod.rs#L188-L190) on write. Whereas datafusion's ParquetSink does not include the arrow schema in the file metadata (a.k.a. it's [missing here](https://github.com/apache/datafusion/blob/ae2ca6a0e21b77bba1ac40ea6ee059e47d0791e0/datafusion/core/src/datasource/file_format/parquet.rs#L1064-L1068)). This missing arrow schema metadata is important, as it's inclusion [aids with later reading](https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/schema/mod.rs#L77-L79). ### To Reproduce 1. Write parquet with ParquetSink. 2. Write parquet with ArrowWriter (default options). 3. Attempt to read the arrow schema from the parquet metadata, using the below/linked APIs: let file_metadata: FileMetadata = <[get from file per API](https://github.com/apache/arrow-rs/blob/bf1a9ec7faa1e271681317572098c4d83297c3a9/parquet/src/file/metadata/mod.rs#L143)>; let arrow_schema = [parquet_to_arrow_schema](https://github.com/apache/arrow-rs/blob/bf1a9ec7faa1e271681317572098c4d83297c3a9/parquet/src/arrow/schema/mod.rs#L53)( file_metadata.schema_descr(), file_metadata.key_value_metadata(), ); 4. An error is returned for parquet written by ParquetSink. ### Expected behavior Parquet written by ParquetSink should have the same default behavior (to include the arrow schema in the parquet metadata) as the ArrowWriter. ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
