[GitHub] [arrow] westonpace commented on issue #11186: Update column names or meta data "in-place" without reading in file contents

GitBox Mon, 20 Sep 2021 12:17:53 -0700


westonpace commented on issue #11186:
URL: https://github.com/apache/arrow/issues/11186#issuecomment-923208251



   No, not really, certainly not today at the pyarrow level.  You would need to 
have your data and metadata in separate files or otherwise introduce some kind 
of padding in between them.  Changing a column name or metadata in general is 
likely to change the size of the metadata block.  The current reader/writer 
packs them together and that would clobber existing data.
   
   In theory though, it might be possible.  I'm pretty sure (but not certain at 
all) the format allows for it.  Some parts of Arrow's C++ parquet library could 
potentially be reused but you would need to do quite a bit of novel development 
to get this I think.
   
   That being said, a potentially easier approach, which would handle column 
names and metadata, is to simply store an authoritative schema as a 
metadata-only standalone file (parquet or Arrow IPC).  Then, after reading in 
your data, you could create a table using your authoritative schema and the 
column data from the file(s) you read into memory.
   
   If you expand that concept much further you start to get into "schema 
evolution" and "metadata storage" concepts in something like Iceberg and so you 
may want to look at that project as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on issue #11186: Update column names or meta data "in-place" without reading in file contents

Reply via email to