Hi folks. It is great to see the community moving forward with changes to parquet metadata to make parquet work better in general and in particular with wider schemata.
I have been looking at the current proposals: - https://github.com/apache/parquet-format/pull/242 - https://github.com/apache/parquet-format/pull/248 - https://github.com/apache/parquet-format/pull/250 and took the consolidated feedback across all of them and put together yet another one. Here's the design sketch <https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit> . What's different in this proposal is splitting the work into 3 tracks: T1. what we can do immediately in the current metadata datastructures T2. what we can do short term in the current metadata datastructures T3. provide safe and backwards compatible room for experimentation for all metadata (including every thrift struct even outside of FileMetaData) so that engines can iterate and propose the best format going forward for parquet 3 is important if we strongly believe that we can get the best design through testing prototypes on real data and measuring the effects vs designing changes in PRs. Along the same lines, I am requesting that you ask through your contacts/customers (I will do the same) for scrubbed footers of particular interest (wide, deep, etc) so that we can build a set of real footers on which we can run benchmarks and drive design decisions. I am also putting normative PRs out for T1, T2, T3. Looking forward to your comments.