raduteo commented on pull request #164:
URL: https://github.com/apache/parquet-format/pull/164#issuecomment-771239827


   @gszadovszky and @emkornfield it's highly coincidental that I was just 
looking into cleaning up apache/arrow#8130 when I noticed this thread.
   External column chunks support is one of the key features that attracted me 
to parquet in the first place and I would like the chance to lobby for keeping 
it and actually expanding its adoption - I already have the complete PR 
mentioned above and I can help with supporting it across other implementations.
   There are a few major domains where I see this as valuable component:
   1. Allowing concurrent read to fully flushed row groups while parquet file 
is still being appended to. A slight variant of this is allowing subsequent row 
group appends to a parquet file without impacting potential readers.
   2. Being able to aggregate multiple data sets in a master parquet file: One 
scenario if cumulative recordings like stock prices that get collected daily 
and need to be presented as one unified historical file, another the case of 
enrichment where we want to add new columns to an existing data set.
   3. Allowing for bi-temporal changes to parquet file: External columns chunks 
allows one to apply small corrections by simply creating delta files and new 
footers that simply swap out the chunks that require changes and point to the 
new ones.
   
   If the above use cases are addressed by other parquet overlays or they don't 
line up with the intended usage of parquet I can look elsewhere but it seems 
like huge opportunity and the development cost for supporting it are quite 
minor by comparison  


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to