As a follow-up to the "V3" Discussions [1][2] I wanted to start a thread on
improvements to the footer metadata.

Based on conversation so far, there have been a few proposals [3][4][5] to
help better support files with wide schemas and many row-groups.  I think
there are a lot of interesting ideas in each. It would be good to get
further feedback on these to make sure we aren't missing anything and
define a minimal first iteration for doing experimental benchmarking to
prove out an approach.

I think the next steps would ideally be:
1.  Come to a consensus on the overall approach.
2.  Prototypes to Benchmark/test to validate the approaches defined (if we
can't come to consensus in item #1, this might help choose a direction).
3.  Divide up any final approach into as fine-grained features as possible.
4.  Implement across parquet-java, parquet-cpp, parquet-rs (and any other
implementations that we can get volunteers for).  Additionally, if new APIs
are needed to make use of the new structure, it would be good to try to
prototype against consumers of Parquet.

Knowing that we have enough people interested in doing #3 is critical to
success, so if you have time to devote, it would be helpful to chime in
here (I know some people already noted they could help in the original
thread).

I think it is likely we will need either an in person sync or another more
focused design document could help. I am happy to try to facilitate this
(once we have a better sense of who wants to be involved and what time
zones they are in I can schedule a sync if necessary).

Thanks,
Micah

[1] https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
[2]
https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
[3] https://github.com/apache/parquet-format/pull/242
[4] https://github.com/apache/parquet-format/pull/248
[5] https://github.com/apache/parquet-format/pull/250

Reply via email to