[GitHub] [arrow-rs] jorgecarleitao commented on issue #746: Examples (or guidance) on parquet usage

GitBox Tue, 07 Sep 2021 22:44:23 -0700


jorgecarleitao commented on issue #746:
URL: https://github.com/apache/arrow-rs/issues/746#issuecomment-914934758



   I am not very experienced in streaming logs to parquet, but here is my take:
   
   AFAIK Parquet files are not entirely suitable for append operations, since 
the footer must have all the information about the row groups. Usually "append 
mode" (e.g. spark) is to create a new file (which is also consistent with the 
data lake paradigm of immutable files). Delta-lake usually addresses some of 
these challenges.
   
   In parquet, each file has its own schema. So, imo the direction here is to 
write a new file per schema and handle schema evolution / merge on read.
   
   For buffering, I would keep the log entries in a `Vec<String>`, serialize it 
to arrow every X entries, and then write it to parquet.
   
   I would probably write to `[email protected]` for pointers.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] jorgecarleitao commented on issue #746: Examples (or guidance) on parquet usage

Reply via email to