wiedld opened a new issue, #11770:
URL: https://github.com/apache/datafusion/issues/11770

   ### Describe the bug
   
   We have been using two parquet writers: ArrowWriter vs ParquetSink 
(parallelized writes). We discovered a bug where the ArrowWriter [includes the 
arrow schema (by default) in the parquet 
metadata](https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/arrow_writer/mod.rs#L188-L190)
 on write.  Whereas datafusion's ParquetSink does not include the arrow schema 
in the file metadata (a.k.a. it's [missing 
here](https://github.com/apache/datafusion/blob/ae2ca6a0e21b77bba1ac40ea6ee059e47d0791e0/datafusion/core/src/datasource/file_format/parquet.rs#L1064-L1068)).
 This missing arrow schema metadata is important, as it's inclusion [aids with 
later 
reading](https://github.com/apache/arrow-rs/blob/2905ce6796cad396241fc50164970dbf1237440a/parquet/src/arrow/schema/mod.rs#L77-L79).
   
   
   
   
   ### To Reproduce
   
   1. Write parquet with ParquetSink.
   2. Write parquet with ArrowWriter (default options).
   3. Attempt to read the arrow schema from the parquet metadata, using the 
below/linked APIs:
   
   
   
   let file_metadata: FileMetadata = <[get from file per 
API](https://github.com/apache/arrow-rs/blob/bf1a9ec7faa1e271681317572098c4d83297c3a9/parquet/src/file/metadata/mod.rs#L143)>;
   
   let arrow_schema = 
[parquet_to_arrow_schema](https://github.com/apache/arrow-rs/blob/bf1a9ec7faa1e271681317572098c4d83297c3a9/parquet/src/arrow/schema/mod.rs#L53)(
         file_metadata.schema_descr(),
         file_metadata.key_value_metadata(),
   );
   
   4. An error is returned for parquet written by ParquetSink.
   
   ### Expected behavior
   
   Parquet written by ParquetSink should have the same default behavior (to 
include the arrow schema in the parquet metadata) as the ArrowWriter.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to