I agree that the goal is to make contributions easier and not a daunting process. We could start the process by separating bigger projects that are impacting the format in a non backward compatible way (new encodings, new footer, etc), versus things that are not as impacting (for example adding metadata that can be ignored by older readers). The goal of the "proposals" list I'm outlining above is really only for bigger projects where we need collaboration across the ecosystem (like we just did for Variant). I'm taking inspiration from other projects here: Airflow Improvement Proposals <https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals> or Flink Improvement Proposals <https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals> I think it's also useful to have a central place to find those.
On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue <rdb...@gmail.com> wrote: > I like many things about the write up, but I want to call out one potential > pitfall. > > I think that this is needed so that we can evolve the project and so we > have a well-understood path for adding new encodings and improvements. If > we can't add new things, then the project will become outdated and > irrelevant. > > I'd like to keep that goal in mind when discussing the path that we are > documenting because there is a risk of having the opposite effect: by > adding so much process or so many requirements to satisfy that people > choose not to contribute or can't make it through to the end. > > You can see this risk at play with many ASF projects that have a > well-defined "path to committer". Often these docs start with guidelines > that say something like "you'll generally need to contribute for about a > year" to give context, but those things turn into rules and the community > doesn't add anyone that hasn't been around for a year. > > In this situation, it's great to say that we want people to run benchmarks > on some representative datasets and I agree that we probably want a > substantial performance improvement to justify the cost of support. But I > think we need to see these things as guidelines and not require running 20 > benchmarks or not considering features with 9% improvements across the > board. > > Ryan > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem <jul...@apache.org> wrote: > > > I opened a Draft PR to illustrate what this could look like. > > https://github.com/apache/parquet-format/pull/513 > > See in tree here: > > https://github.com/apache/parquet-format/tree/proposals/proposals > > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem <jul...@apache.org> wrote: > > > > > IMO, this doc is pretty close to being ready to be published. We can > > > always improve it as we go. > > > > > > I think that one important part of the whole process is to make it easy > > > for everyone to see what proposals are ongoing and their status and a > > clear > > > step to move from proposal/evaluation to implementation. > > > > > > Once we agree the doc is close enough, I would propose to publish it in > > > markdown on the parquet-format repo, organized as follows: > > > - The section "Baseline Requirements for new additions" as its own > page, > > > documenting how to approach the design of a parquet change and the > > > underlying constraints. > > > - We add a physical process to list proposals in the parquet-format > > github > > > Repo as follows. > > > - The steps described in the section "Incorporating > encoding/compression > > > improvements" become the process on how someone creates a proposal and > > > starts a POC. > > > - I would complement it by the following steps for people to publish > > their > > > proposals: > > > - We create a folder in the parquet-format repo to hold the > proposals. > > > - a Readme in the folder tracks the ongoing POCs and status. > > > - Initiating a proposal starts with a github issue. We create a > > > template for it based on what's outlined in that section of the doc. > > > - If the discussion concludes that the proposal is worth a POC, > > > the author opens a PR to add the proposal in markdown in the proposals > > > folder. It links to the Github issue where the discussion preceding the > > > proposal occurred. More people can contribute to the POC as needed. > > > - POC and perf evaluation are implemented as part of the proposal. > > > - a vote by the PMC moves the proposal to actual feature in the > format > > > (based on the criteria outlined in this doc). > > > - As part of the implementation step, we make sure we have cross > > > compatible implementations as we did for Variant. > > > - The section "Measuring improvements" becomes part of that process > > > section to explain how we'll decide if the addition is worth adding to > > the > > > spec for the complexity it is adding. > > > > > > If that makes sense to you all, I can draft a PR to make this proposal > a > > > little more concrete. > > > > > > > > > > > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb <andrewlam...@gmail.com> > > > wrote: > > > > > >> I would like to bump this thread as it came up again on the parquet > sync > > >> call today > > >> > > >> Specifically, it seems like there is increasing interest in adding new > > >> encodings to the Parquet, so getting consensus on what that process > > looks > > >> like and what is required is more important. > > >> > > >> If you are interested in this topic, please leave comments on the > Google > > >> Doc[1] or reply to this email chain. > > >> > > >> Thank you, > > >> Andrew > > >> > > >> [1] > > >> > > >> > > > https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 > > >> > > >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield < > emkornfi...@gmail.com> > > >> wrote: > > >> > > >> > I wrote up a long overdue draft > > >> > < > > >> > > > >> > > > https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 > > >> > > > > >> > [1] > > >> > on how we can move forward with additional features (it provides > some > > >> > proposed requirements on both consuming third-party code, as well as > > >> some > > >> > more specific guidance on new encodings, and some orthogonal work > that > > >> > would be nice to see). > > >> > > > >> > The doc still lacks some details, and might be too opinionated in > > places > > >> > but I think it serves as a good basis for conversation (and at least > > >> gets > > >> > me out of the critical path for evolving Parquet). > > >> > > > >> > I'm very excited to start moving forward with improvements. > > >> > > > >> > Thanks, > > >> > Micah > > >> > > > >> > [1] > > >> > > > >> > > > >> > > > https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 > > >> > > > >> > > > > > >