Hi all,

I've discussed with my colleagues and we would dedicate two engineers for
4-6 months on tasks related to implementing the format changes. We're
already active in design discussions and can help with C++, Rust and C#
implementations. I thought it'd be good to state this explicitly FWIW.

Our main areas of interest are efficient reads for tables with wide schemas
and faster random rowgroup access [1].

To workaround the wide schemas issue we actually implemented an internal
tool [3] for storing index information into a separate file which allows
for reading only the necessary subset of metadata. We would offer this
approach for consideration as a possible approach to solve the wide schema
problem.

[1] https://github.com/apache/arrow/issues/39676
[2] https://github.com/G-Research/PalletJack

Rok

On Sun, May 12, 2024 at 12:59 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Parquet Dev,
> I wanted to start a conversation within the community about working on a
> new revision of Parquet.  For context there have been a bunch of new
> formats [1][2][3] that show there is decent room for improvement across
> data encodings and how metadata is organized.
>
> Specifically, in a new format revision I think we should be thinking about
> the following areas for improvements:
> 1.  More efficient encodings that allow for data skipping and SIMD
> optimizations.
> 2.  More efficient metadata handling for deserialization and projection to
> address areas when metadata deserialization time is not trivial [4].
> 3.  Possibly thinking about different encodings instead of
> repetition/definition for repeated and nested field
> 4.  Support for optimizing semi-structured data (e.g. JSON or Variant type)
> that can shred elements into individual columns (a recent thread in Iceberg
> mentions doing this at the metadata level [5])
>
> I think the goals of V3 would be to provide existing API compatibility as
> broadly as possible (possibly with some performance loss) and expose new
> API surface areas where appropriate to make use of new elements.  New
> encodings could be backported so they can be made use of without metadata
> changes.  I think unfortunately that for points 2 and 3 we would want to
> break file level compatibility.  More thought would be needed to consider
> whether 4 could be backported effectively.
>
> This is a non-trivial amount of work to get good coverage across
> implementations, so before putting together more formal proposal it would
> be nice to know if:
>
> 1.  If there is an appetite in the general community to consider these
> changes
> 2.  If anybody from the community is interested in collaborating on
> proposals/implementation in this area.
>
> Thanks,
> Micah
>
> [1] https://github.com/maxi-k/btrblocks
> [2] https://github.com/facebookincubator/nimble
> [3] https://blog.lancedb.com/lance-v2/
> [4] https://github.com/apache/arrow/issues/39676
> [5] https://lists.apache.org/thread/xnyo1k66dxh0ffpg7j9f04xgos0kwc34
>

Reply via email to