jorgecarleitao commented on issue #746: URL: https://github.com/apache/arrow-rs/issues/746#issuecomment-914934758
I am not very experienced in streaming logs to parquet, but here is my take: AFAIK Parquet files are not entirely suitable for append operations, since the footer must have all the information about the row groups. Usually "append mode" (e.g. spark) is to create a new file (which is also consistent with the data lake paradigm of immutable files). Delta-lake usually addresses some of these challenges. In parquet, each file has its own schema. So, imo the direction here is to write a new file per schema and handle schema evolution / merge on read. For buffering, I would keep the log entries in a `Vec<String>`, serialize it to arrow every X entries, and then write it to parquet. I would probably write to `[email protected]` for pointers. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
