FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow. If you could take a second look, I would appreciate it. Thank you !
On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem <jul...@apache.org> wrote: > Thank you for the feedback. > I have updated the PR with all the feedback and introduced language to > remove gatekeeping as much as possible and encourage people to feel > empowered to propose and contribute new things. > > https://github.com/apache/parquet-format/pull/513 > See in tree here: > https://github.com/apache/parquet-format/tree/proposals/proposals > > > On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb <andrewlam...@gmail.com> > wrote: > >> I think the PR[1][2] that Julien created is a pretty nice high level flow >> as it: >> 1. Mostly documents clearly what is already done in practice >> 2. Postpones concerns and consensus about potentially overly restrictive >> requirements for new features (but not trying to exhaustively specify the >> criteria) >> 3. Gives a location to list active proposals >> >> We could make progress with his PR without having to come to a consensus >> on >> the criteria for inclusion. >> >> Once we had that high level flow up, we could try it out and formalize >> some of the criteria that are used for changes. >> >> Andrew >> >> >> [1]: https://github.com/apache/parquet-format/pull/513 >> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals >> >> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> > > >> > > In this situation, it's great to say that we want people to run >> > benchmarks >> > > on some representative datasets and I agree that we probably want a >> > > substantial performance improvement to justify the cost of support. >> But I >> > > think we need to see these things as guidelines and not require >> running >> > 20 >> > > >> > > The intention at least in the doc was to require 20 plus datasets but >> to >> > collect at least a list of open datasets that we can narrow down. What >> I >> > would at least like to see is a fairly standard set of data to make >> > comparisons consistent. We also discussed this in the sync. I think >> it >> > will be up to someone who has bandwidth to help at least designate a >> subset >> > of what we want to include. >> > >> > benchmarks or not considering features with 9% improvements across the >> > > board. >> > >> > Sure, we can maybe make the language softer language on having a target >> > percentage be a target goal but there can be trade-offs. >> > >> > I actually think having some sort of baseline helps to function as >> making >> > things easier in some ways as long as other requirements are met >> because it >> > removes some amount of subjectivity. >> > >> > Cheers, >> > Micah >> > >> > >> > >> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem <jul...@apache.org> wrote: >> > >> > > I agree that the goal is to make contributions easier and not a >> daunting >> > > process. >> > > We could start the process by separating bigger projects that are >> > impacting >> > > the format in a non backward compatible way (new encodings, new >> footer, >> > > etc), versus things that are not as impacting (for example adding >> > metadata >> > > that can be ignored by older readers). >> > > The goal of the "proposals" list I'm outlining above is really only >> for >> > > bigger projects where we need collaboration across the ecosystem >> (like we >> > > just did for Variant). >> > > I'm taking inspiration from other projects here: Airflow Improvement >> > > Proposals >> > > < >> > > >> > >> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals >> > > > >> > > or Flink Improvement Proposals >> > > < >> > > >> > >> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals >> > > > >> > > I think it's also useful to have a central place to find those. >> > > >> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue <rdb...@gmail.com> wrote: >> > > >> > > > I like many things about the write up, but I want to call out one >> > > potential >> > > > pitfall. >> > > > >> > > > I think that this is needed so that we can evolve the project and >> so we >> > > > have a well-understood path for adding new encodings and >> improvements. >> > If >> > > > we can't add new things, then the project will become outdated and >> > > > irrelevant. >> > > > >> > > > I'd like to keep that goal in mind when discussing the path that we >> are >> > > > documenting because there is a risk of having the opposite effect: >> by >> > > > adding so much process or so many requirements to satisfy that >> people >> > > > choose not to contribute or can't make it through to the end. >> > > > >> > > > You can see this risk at play with many ASF projects that have a >> > > > well-defined "path to committer". Often these docs start with >> > guidelines >> > > > that say something like "you'll generally need to contribute for >> about >> > a >> > > > year" to give context, but those things turn into rules and the >> > community >> > > > doesn't add anyone that hasn't been around for a year. >> > > > >> > > > In this situation, it's great to say that we want people to run >> > > benchmarks >> > > > on some representative datasets and I agree that we probably want a >> > > > substantial performance improvement to justify the cost of support. >> > But I >> > > > think we need to see these things as guidelines and not require >> running >> > > 20 >> > > > benchmarks or not considering features with 9% improvements across >> the >> > > > board. >> > > > >> > > > Ryan >> > > > >> > > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem <jul...@apache.org> >> > wrote: >> > > > >> > > > > I opened a Draft PR to illustrate what this could look like. >> > > > > https://github.com/apache/parquet-format/pull/513 >> > > > > See in tree here: >> > > > > https://github.com/apache/parquet-format/tree/proposals/proposals >> > > > > >> > > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem <jul...@apache.org> >> > > wrote: >> > > > > >> > > > > > IMO, this doc is pretty close to being ready to be published. We >> > can >> > > > > > always improve it as we go. >> > > > > > >> > > > > > I think that one important part of the whole process is to make >> it >> > > easy >> > > > > > for everyone to see what proposals are ongoing and their status >> > and a >> > > > > clear >> > > > > > step to move from proposal/evaluation to implementation. >> > > > > > >> > > > > > Once we agree the doc is close enough, I would propose to >> publish >> > it >> > > in >> > > > > > markdown on the parquet-format repo, organized as follows: >> > > > > > - The section "Baseline Requirements for new additions" as its >> own >> > > > page, >> > > > > > documenting how to approach the design of a parquet change and >> the >> > > > > > underlying constraints. >> > > > > > - We add a physical process to list proposals in the >> parquet-format >> > > > > github >> > > > > > Repo as follows. >> > > > > > - The steps described in the section "Incorporating >> > > > encoding/compression >> > > > > > improvements" become the process on how someone creates a >> proposal >> > > and >> > > > > > starts a POC. >> > > > > > - I would complement it by the following steps for people to >> > publish >> > > > > their >> > > > > > proposals: >> > > > > > - We create a folder in the parquet-format repo to hold the >> > > > proposals. >> > > > > > - a Readme in the folder tracks the ongoing POCs and status. >> > > > > > - Initiating a proposal starts with a github issue. We >> create a >> > > > > > template for it based on what's outlined in that section of the >> > doc. >> > > > > > - If the discussion concludes that the proposal is worth a >> POC, >> > > > > > the author opens a PR to add the proposal in markdown in the >> > > proposals >> > > > > > folder. It links to the Github issue where the discussion >> preceding >> > > the >> > > > > > proposal occurred. More people can contribute to the POC as >> needed. >> > > > > > - POC and perf evaluation are implemented as part of the >> > proposal. >> > > > > > - a vote by the PMC moves the proposal to actual feature in >> the >> > > > format >> > > > > > (based on the criteria outlined in this doc). >> > > > > > - As part of the implementation step, we make sure we have >> cross >> > > > > > compatible implementations as we did for Variant. >> > > > > > - The section "Measuring improvements" becomes part of that >> process >> > > > > > section to explain how we'll decide if the addition is worth >> adding >> > > to >> > > > > the >> > > > > > spec for the complexity it is adding. >> > > > > > >> > > > > > If that makes sense to you all, I can draft a PR to make this >> > > proposal >> > > > a >> > > > > > little more concrete. >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb < >> > andrewlam...@gmail.com> >> > > > > > wrote: >> > > > > > >> > > > > >> I would like to bump this thread as it came up again on the >> > parquet >> > > > sync >> > > > > >> call today >> > > > > >> >> > > > > >> Specifically, it seems like there is increasing interest in >> adding >> > > new >> > > > > >> encodings to the Parquet, so getting consensus on what that >> > process >> > > > > looks >> > > > > >> like and what is required is more important. >> > > > > >> >> > > > > >> If you are interested in this topic, please leave comments on >> the >> > > > Google >> > > > > >> Doc[1] or reply to this email chain. >> > > > > >> >> > > > > >> Thank you, >> > > > > >> Andrew >> > > > > >> >> > > > > >> [1] >> > > > > >> >> > > > > >> >> > > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >> > > > > >> >> > > > > >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield < >> > > > emkornfi...@gmail.com> >> > > > > >> wrote: >> > > > > >> >> > > > > >> > I wrote up a long overdue draft >> > > > > >> > < >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >> > > > > >> > > >> > > > > >> > [1] >> > > > > >> > on how we can move forward with additional features (it >> provides >> > > > some >> > > > > >> > proposed requirements on both consuming third-party code, as >> > well >> > > as >> > > > > >> some >> > > > > >> > more specific guidance on new encodings, and some orthogonal >> > work >> > > > that >> > > > > >> > would be nice to see). >> > > > > >> > >> > > > > >> > The doc still lacks some details, and might be too >> opinionated >> > in >> > > > > places >> > > > > >> > but I think it serves as a good basis for conversation (and >> at >> > > least >> > > > > >> gets >> > > > > >> > me out of the critical path for evolving Parquet). >> > > > > >> > >> > > > > >> > I'm very excited to start moving forward with improvements. >> > > > > >> > >> > > > > >> > Thanks, >> > > > > >> > Micah >> > > > > >> > >> > > > > >> > [1] >> > > > > >> > >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0 >> > > > > >> > >> > > > > >> >> > > > > > >> > > > > >> > > > >> > > >> > >> >