[GitHub] [arrow-rs] nevi-me commented on issue #252: schema: missing field `metadata` when writing to parquet file

GitBox Mon, 10 May 2021 09:29:58 -0700


nevi-me commented on issue #252:
URL: https://github.com/apache/arrow-rs/issues/252#issuecomment-836925868



   > Perhaps we could make the simplifying assumption and say "the arrow schema 
is supposed to be the same for all record, and thus we assume the metadata that 
applies to all the rows should be the same as well"?
   
   If I think about how the IPC format works, we send the schema first, and 
then send batches after. The batches don't have a copy of the schema, but would 
have just the buffers making up the data.
   
   So, my thinking is that we:
   
   * Write the schema of the Arrow data to `FileMetaData`
   * Write the schema of each field to `ColumnMetaData`
   * Use the schema that's provided in the write function, and not the ones 
from each `ArrowWriter::write(batch: &RecordBatch)`.
   
   I can't think of a valid use-case where we expect a stream of Arrow data's 
metadata (at a schema or field) to change mid-stream. I don't think we'd even 
be able to communicate such a scenario with `arrow-flight`.
   
   I wonder though, if Parquet ordinarily handles a scenario where the metadata 
per file is different 🤔


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] nevi-me commented on issue #252: schema: missing field `metadata` when writing to parquet file

Reply via email to