Hi Parquet Dev, I wanted to start a conversation within the community about working on a new revision of Parquet. For context there have been a bunch of new formats [1][2][3] that show there is decent room for improvement across data encodings and how metadata is organized.
Specifically, in a new format revision I think we should be thinking about the following areas for improvements: 1. More efficient encodings that allow for data skipping and SIMD optimizations. 2. More efficient metadata handling for deserialization and projection to address areas when metadata deserialization time is not trivial [4]. 3. Possibly thinking about different encodings instead of repetition/definition for repeated and nested field 4. Support for optimizing semi-structured data (e.g. JSON or Variant type) that can shred elements into individual columns (a recent thread in Iceberg mentions doing this at the metadata level [5]) I think the goals of V3 would be to provide existing API compatibility as broadly as possible (possibly with some performance loss) and expose new API surface areas where appropriate to make use of new elements. New encodings could be backported so they can be made use of without metadata changes. I think unfortunately that for points 2 and 3 we would want to break file level compatibility. More thought would be needed to consider whether 4 could be backported effectively. This is a non-trivial amount of work to get good coverage across implementations, so before putting together more formal proposal it would be nice to know if: 1. If there is an appetite in the general community to consider these changes 2. If anybody from the community is interested in collaborating on proposals/implementation in this area. Thanks, Micah [1] https://github.com/maxi-k/btrblocks [2] https://github.com/facebookincubator/nimble [3] https://blog.lancedb.com/lance-v2/ [4] https://github.com/apache/arrow/issues/39676 [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34