Hi all, I've discussed with my colleagues and we would dedicate two engineers for 4-6 months on tasks related to implementing the format changes. We're already active in design discussions and can help with C++, Rust and C# implementations. I thought it'd be good to state this explicitly FWIW.
Our main areas of interest are efficient reads for tables with wide schemas and faster random rowgroup access [1]. To workaround the wide schemas issue we actually implemented an internal tool [3] for storing index information into a separate file which allows for reading only the necessary subset of metadata. We would offer this approach for consideration as a possible approach to solve the wide schema problem. [1] https://github.com/apache/arrow/issues/39676 [2] https://github.com/G-Research/PalletJack Rok On Sun, May 12, 2024 at 12:59 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > Hi Parquet Dev, > I wanted to start a conversation within the community about working on a > new revision of Parquet. For context there have been a bunch of new > formats [1][2][3] that show there is decent room for improvement across > data encodings and how metadata is organized. > > Specifically, in a new format revision I think we should be thinking about > the following areas for improvements: > 1. More efficient encodings that allow for data skipping and SIMD > optimizations. > 2. More efficient metadata handling for deserialization and projection to > address areas when metadata deserialization time is not trivial [4]. > 3. Possibly thinking about different encodings instead of > repetition/definition for repeated and nested field > 4. Support for optimizing semi-structured data (e.g. JSON or Variant type) > that can shred elements into individual columns (a recent thread in Iceberg > mentions doing this at the metadata level [5]) > > I think the goals of V3 would be to provide existing API compatibility as > broadly as possible (possibly with some performance loss) and expose new > API surface areas where appropriate to make use of new elements. New > encodings could be backported so they can be made use of without metadata > changes. I think unfortunately that for points 2 and 3 we would want to > break file level compatibility. More thought would be needed to consider > whether 4 could be backported effectively. > > This is a non-trivial amount of work to get good coverage across > implementations, so before putting together more formal proposal it would > be nice to know if: > > 1. If there is an appetite in the general community to consider these > changes > 2. If anybody from the community is interested in collaborating on > proposals/implementation in this area. > > Thanks, > Micah > > [1] https://github.com/maxi-k/btrblocks > [2] https://github.com/facebookincubator/nimble > [3] https://blog.lancedb.com/lance-v2/ > [4] https://github.com/apache/arrow/issues/39676 > [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34 >