Hi Parquet Dev,
I wanted to start a conversation within the community about working on a
new revision of Parquet.  For context there have been a bunch of new
formats [1][2][3] that show there is decent room for improvement across
data encodings and how metadata is organized.

Specifically, in a new format revision I think we should be thinking about
the following areas for improvements:
1.  More efficient encodings that allow for data skipping and SIMD
optimizations.
2.  More efficient metadata handling for deserialization and projection to
address areas when metadata deserialization time is not trivial [4].
3.  Possibly thinking about different encodings instead of
repetition/definition for repeated and nested field
4.  Support for optimizing semi-structured data (e.g. JSON or Variant type)
that can shred elements into individual columns (a recent thread in Iceberg
mentions doing this at the metadata level [5])

I think the goals of V3 would be to provide existing API compatibility as
broadly as possible (possibly with some performance loss) and expose new
API surface areas where appropriate to make use of new elements.  New
encodings could be backported so they can be made use of without metadata
changes.  I think unfortunately that for points 2 and 3 we would want to
break file level compatibility.  More thought would be needed to consider
whether 4 could be backported effectively.

This is a non-trivial amount of work to get good coverage across
implementations, so before putting together more formal proposal it would
be nice to know if:

1.  If there is an appetite in the general community to consider these
changes
2.  If anybody from the community is interested in collaborating on
proposals/implementation in this area.

Thanks,
Micah

[1] https://github.com/maxi-k/btrblocks
[2] https://github.com/facebookincubator/nimble
[3] https://blog.lancedb.com/lance-v2/
[4] https://github.com/apache/arrow/issues/39676
[5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34

Reply via email to