alippai commented on issue #35638:
URL: https://github.com/apache/arrow/issues/35638#issuecomment-1553346253

   @westonpace reading the [parquet thrift 
doc](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift)
 the naive approach would be keeping the buffers and statistics only, 
recreating everything else. I didn't know parquet works like this, thanks for 
the insight!
   
   My goal is slightly different from deltalake and others (and I'm also not 
fan of JVM based setups for this kind of workload). My idea was relying less on 
the traditional FS and using the internal structure of the parquet more because 
of the very reason you've mentioned (filters, statistics). Architecturally 
Skyhook would be closer to this or "simply" storing all the metadata + 
statistics in TiKV or other kv store.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to