As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread on improvements to the footer metadata.
Based on conversation so far, there have been a few proposals [3][4][5] to help better support files with wide schemas and many row-groups. I think there are a lot of interesting ideas in each. It would be good to get further feedback on these to make sure we aren't missing anything and define a minimal first iteration for doing experimental benchmarking to prove out an approach. I think the next steps would ideally be: 1. Come to a consensus on the overall approach. 2. Prototypes to Benchmark/test to validate the approaches defined (if we can't come to consensus in item #1, this might help choose a direction). 3. Divide up any final approach into as fine-grained features as possible. 4. Implement across parquet-java, parquet-cpp, parquet-rs (and any other implementations that we can get volunteers for). Additionally, if new APIs are needed to make use of the new structure, it would be good to try to prototype against consumers of Parquet. Knowing that we have enough people interested in doing #3 is critical to success, so if you have time to devote, it would be helpful to chime in here (I know some people already noted they could help in the original thread). I think it is likely we will need either an in person sync or another more focused design document could help. I am happy to try to facilitate this (once we have a better sense of who wants to be involved and what time zones they are in I can schedule a sync if necessary). Thanks, Micah [1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo [2] https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit [3] https://github.com/apache/parquet-format/pull/242 [4] https://github.com/apache/parquet-format/pull/248 [5] https://github.com/apache/parquet-format/pull/250