westonpace commented on issue #11186: URL: https://github.com/apache/arrow/issues/11186#issuecomment-923208251
No, not really, certainly not today at the pyarrow level. You would need to have your data and metadata in separate files or otherwise introduce some kind of padding in between them. Changing a column name or metadata in general is likely to change the size of the metadata block. The current reader/writer packs them together and that would clobber existing data. In theory though, it might be possible. I'm pretty sure (but not certain at all) the format allows for it. Some parts of Arrow's C++ parquet library could potentially be reused but you would need to do quite a bit of novel development to get this I think. That being said, a potentially easier approach, which would handle column names and metadata, is to simply store an authoritative schema as a metadata-only standalone file (parquet or Arrow IPC). Then, after reading in your data, you could create a table using your authoritative schema and the column data from the file(s) you read into memory. If you expand that concept much further you start to get into "schema evolution" and "metadata storage" concepts in something like Iceberg and so you may want to look at that project as well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
