I wanted to share some work Xiangpeng Hao did at InfluxData this summer on the current (thrift) metadata format[1].
We found that with careful software engineering, we could likely improve the speed of reading existing parquet footer format by a factor of 4 or more ([2] contains some specific ideas). While we analyzed the Rust implementation, I believe a similar conclusion applies to C/C++. I realize that there are certain features that switching to an entirely new footer format would achieve, but the cost to adopting a new format across the ecosystem is immense (e.g. Parquet "version 2.0" etc). It is my opinion that investing the same effort in software optimization that would be required for a new footer format would have a much bigger impact Andrew [1]: https://www.influxdata.com/blog/how-good-parquet-wide-tables/ [2]: https://github.com/apache/arrow-rs/issues/5853 On Thu, Aug 15, 2024 at 4:26 AM Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > Hi Julien. > > Thank you for reconnecting the threads. > > I have broken down my experiments in a narrative, commit by commit on how > we can go from flatbuffers being ~2x larger than thrift to being smaller > (and at times even half) the size of thrift. This is still on an internal > branch, I will resume work towards the end of this month to port it to > arrow so that folks can look at the progress and share ideas. > > On the benchmarking front I need to build and share a binary for third > parties to donate their footers for analysis. > > The PR for parquet extensions has gotten a few rounds of reviews: > https://github.com/apache/parquet-format/pull/254. I hope it will be > merged > soon. > > I missed the sync yesterday - for some reason I didn't receive an > invitation. Julien could you add me again to the invite list? > > On Thu, Aug 15, 2024 at 1:32 AM Julien Le Dem <jul...@apache.org> wrote: > > > This came up in the sync today. > > > > There are a few concurrent experiments with flatbuffers for a future > > Parquet footer replacement. In itself it is fine and just wanted to > > reconnect the threads here so that folks are aware of each other and can > > share findings. > > > > - Neelaksh benchmarking and experiments: > > > > > https://medium.com/@neelaksh-singh/benchmarking-apache-parquet-my-mid-program-journey-as-an-mlh-fellow-bc0b8332c3b1 > > https://github.com/Neelaksh-Singh/gresearch_parquet_benchmarking > > > > - Alkis has also been experimenting and led the proposal for enabling > > extending the existing footer. > > > > > https://docs.google.com/document/d/1KkoR0DjzYnLQXO-d0oRBv2k157IZU0_injqd4eV4WiI/edit#heading=h.15ohoov5qqm6 > > > > - Xuwei also shared that he is looking into this. > > > > I would suggest that you all reply to this thread sharing your current > > progress or ideas and a link to your respective repos for experimenting. > > > > Best > > Julien > > >