[GitHub] [arrow-rs] msalib opened a new issue, #4023: `ParquetRecordBatchStream` is inconsistent about schemas

via GitHub Wed, 05 Apr 2023 13:01:53 -0700


msalib opened a new issue, #4023:
URL: https://github.com/apache/arrow-rs/issues/4023


   **Describe the bug**
   
   Let's say you're trying to async read a Parquet file on S3, and that file 
has metadata (like "created by"). There's an inconsistency:
   
   `ParquetRecordBatchStream::schema` will produce a `Schema` object that 
includes that metadata.
   But `ParquetRecordBatchStream` will yield `RecordBatch`es that have schema 
objects that don't have the metadata.
   
   The problem is that if you create an  `ArrowWriter` using the first schema 
and then try to write batches from the stream to it, the schemas won't match 
(the writer is expecting metadata but each batch has a schema without metadata).
   
   **Expected behavior**
   
   I'd expect that either:
   * `ParquetRecordBatchStream::schema` produces a `Schema` without metadata, or
   * the `RecordBatch`es produced by `ParquetRecordBatchStream` have the exact 
same schema as what `::schema` returns, or
   * `ArrowWriter` should tolerate its supplied schema differing from the batch 
schemas provided to `write()` in metadata
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] msalib opened a new issue, #4023: `ParquetRecordBatchStream` is inconsistent about schemas

Reply via email to