alippai commented on issue #35638: URL: https://github.com/apache/arrow/issues/35638#issuecomment-1553346253
@westonpace reading the [parquet thrift doc](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift) the naive approach would be keeping the buffers and statistics only, recreating everything else. I didn't know parquet works like this, thanks for the insight! My goal is slightly different from deltalake and others (and I'm also not fan of JVM based setups for this kind of workload). My idea was relying less on the traditional FS and using the internal structure of the parquet more because of the very reason you've mentioned (filters, statistics). Architecturally Skyhook would be closer to this or "simply" storing all the metadata + statistics in TiKV or other kv store. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
