[DISCUSS] Parquet metadata evolution proposal

Alkis Evlogimenos Wed, 29 May 2024 12:36:55 -0700

Hi folks.

It is great to see the community moving forward with changes to parquet
metadata to make parquet work better in general and in particular with
wider schemata.


I have been looking at the current proposals:
- https://github.com/apache/parquet-format/pull/242
- https://github.com/apache/parquet-format/pull/248
- https://github.com/apache/parquet-format/pull/250

and took the consolidated feedback across all of them and put together yet
another one. Here's the design sketch
<https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit>
.

What's different in this proposal is splitting the work into 3 tracks:
T1. what we can do immediately in the current metadata datastructures
T2. what we can do short term in the current metadata datastructures
T3. provide safe and backwards compatible room for experimentation for all
metadata (including every thrift struct even outside of FileMetaData) so
that engines can iterate and propose the best format going forward for
parquet

3 is important if we strongly believe that we can get the best design
through testing prototypes  on real data and measuring the effects vs
designing changes in PRs. Along the same lines, I am requesting that you
ask through your contacts/customers (I will do the same) for scrubbed
footers of particular interest (wide, deep, etc) so that we can build a set
of real footers on which we can run benchmarks and drive design decisions.

I am also putting normative PRs out for T1, T2, T3.

Looking forward to your comments.

[DISCUSS] Parquet metadata evolution proposal

Reply via email to