> > In this situation, it's great to say that we want people to run benchmarks > on some representative datasets and I agree that we probably want a > substantial performance improvement to justify the cost of support. But I > think we need to see these things as guidelines and not require running 20 > > The intention at least in the doc was to require 20 plus datasets but to collect at least a list of open datasets that we can narrow down. What I would at least like to see is a fairly standard set of data to make comparisons consistent. We also discussed this in the sync. I think it will be up to someone who has bandwidth to help at least designate a subset of what we want to include.
benchmarks or not considering features with 9% improvements across the > board. Sure, we can maybe make the language softer language on having a target percentage be a target goal but there can be trade-offs. I actually think having some sort of baseline helps to function as making things easier in some ways as long as other requirements are met because it removes some amount of subjectivity. Cheers, Micah On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem <jul...@apache.org> wrote: > I agree that the goal is to make contributions easier and not a daunting > process. > We could start the process by separating bigger projects that are impacting > the format in a non backward compatible way (new encodings, new footer, > etc), versus things that are not as impacting (for example adding metadata > that can be ignored by older readers). > The goal of the "proposals" list I'm outlining above is really only for > bigger projects where we need collaboration across the ecosystem (like we > just did for Variant). > I'm taking inspiration from other projects here: Airflow Improvement > Proposals > < > https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals > > > or Flink Improvement Proposals > < > https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals > > > I think it's also useful to have a central place to find those. > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue <rdb...@gmail.com> wrote: > > > I like many things about the write up, but I want to call out one > potential > > pitfall. > > > > I think that this is needed so that we can evolve the project and so we > > have a well-understood path for adding new encodings and improvements. If > > we can't add new things, then the project will become outdated and > > irrelevant. > > > > I'd like to keep that goal in mind when discussing the path that we are > > documenting because there is a risk of having the opposite effect: by > > adding so much process or so many requirements to satisfy that people > > choose not to contribute or can't make it through to the end. > > > > You can see this risk at play with many ASF projects that have a > > well-defined "path to committer". Often these docs start with guidelines > > that say something like "you'll generally need to contribute for about a > > year" to give context, but those things turn into rules and the community > > doesn't add anyone that hasn't been around for a year. > > > > In this situation, it's great to say that we want people to run > benchmarks > > on some representative datasets and I agree that we probably want a > > substantial performance improvement to justify the cost of support. But I > > think we need to see these things as guidelines and not require running > 20 > > benchmarks or not considering features with 9% improvements across the > > board. > > > > Ryan > > > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem <jul...@apache.org> wrote: > > > > > I opened a Draft PR to illustrate what this could look like. > > > https://github.com/apache/parquet-format/pull/513 > > > See in tree here: > > > https://github.com/apache/parquet-format/tree/proposals/proposals > > > > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem <jul...@apache.org> > wrote: > > > > > > > IMO, this doc is pretty close to being ready to be published. We can > > > > always improve it as we go. > > > > > > > > I think that one important part of the whole process is to make it > easy > > > > for everyone to see what proposals are ongoing and their status and a > > > clear > > > > step to move from proposal/evaluation to implementation. > > > > > > > > Once we agree the doc is close enough, I would propose to publish it > in > > > > markdown on the parquet-format repo, organized as follows: > > > > - The section "Baseline Requirements for new additions" as its own > > page, > > > > documenting how to approach the design of a parquet change and the > > > > underlying constraints. > > > > - We add a physical process to list proposals in the parquet-format > > > github > > > > Repo as follows. > > > > - The steps described in the section "Incorporating > > encoding/compression > > > > improvements" become the process on how someone creates a proposal > and > > > > starts a POC. > > > > - I would complement it by the following steps for people to publish > > > their > > > > proposals: > > > > - We create a folder in the parquet-format repo to hold the > > proposals. > > > > - a Readme in the folder tracks the ongoing POCs and status. > > > > - Initiating a proposal starts with a github issue. We create a > > > > template for it based on what's outlined in that section of the doc. > > > > - If the discussion concludes that the proposal is worth a POC, > > > > the author opens a PR to add the proposal in markdown in the > proposals > > > > folder. It links to the Github issue where the discussion preceding > the > > > > proposal occurred. More people can contribute to the POC as needed. > > > > - POC and perf evaluation are implemented as part of the proposal. > > > > - a vote by the PMC moves the proposal to actual feature in the > > format > > > > (based on the criteria outlined in this doc). > > > > - As part of the implementation step, we make sure we have cross > > > > compatible implementations as we did for Variant. > > > > - The section "Measuring improvements" becomes part of that process > > > > section to explain how we'll decide if the addition is worth adding > to > > > the > > > > spec for the complexity it is adding. > > > > > > > > If that makes sense to you all, I can draft a PR to make this > proposal > > a > > > > little more concrete. > > > > > > > > > > > > > > > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb <andrewlam...@gmail.com> > > > > wrote: > > > > > > > >> I would like to bump this thread as it came up again on the parquet > > sync > > > >> call today > > > >> > > > >> Specifically, it seems like there is increasing interest in adding > new > > > >> encodings to the Parquet, so getting consensus on what that process > > > looks > > > >> like and what is required is more important. > > > >> > > > >> If you are interested in this topic, please leave comments on the > > Google > > > >> Doc[1] or reply to this email chain. > > > >> > > > >> Thank you, > > > >> Andrew > > > >> > > > >> [1] > > > >> > > > >> > > > > > > https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 > > > >> > > > >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield < > > emkornfi...@gmail.com> > > > >> wrote: > > > >> > > > >> > I wrote up a long overdue draft > > > >> > < > > > >> > > > > >> > > > > > > https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 > > > >> > > > > > >> > [1] > > > >> > on how we can move forward with additional features (it provides > > some > > > >> > proposed requirements on both consuming third-party code, as well > as > > > >> some > > > >> > more specific guidance on new encodings, and some orthogonal work > > that > > > >> > would be nice to see). > > > >> > > > > >> > The doc still lacks some details, and might be too opinionated in > > > places > > > >> > but I think it serves as a good basis for conversation (and at > least > > > >> gets > > > >> > me out of the critical path for evolving Parquet). > > > >> > > > > >> > I'm very excited to start moving forward with improvements. > > > >> > > > > >> > Thanks, > > > >> > Micah > > > >> > > > > >> > [1] > > > >> > > > > >> > > > > >> > > > > > > https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 > > > >> > > > > >> > > > > > > > > > >