Re: [Discuss] Feature addition requirements/process

Julien Le Dem Tue, 02 Sep 2025 16:55:20 -0700

FYI: I'm hoping to get closer to a conclusion in the meeting tomorrow.
If you could take a second look, I would appreciate it.
Thank you !


On Fri, Aug 29, 2025 at 3:54 PM Julien Le Dem <[email protected]> wrote:

> Thank you for the feedback.
> I have updated the PR with all the feedback and introduced language to
> remove gatekeeping as much as possible and encourage people to feel
> empowered to propose and contribute new things.
>
> https://github.com/apache/parquet-format/pull/513
> See in tree here:
> https://github.com/apache/parquet-format/tree/proposals/proposals
>
>
> On Mon, Aug 11, 2025 at 6:57 AM Andrew Lamb <[email protected]>
> wrote:
>
>> I think the PR[1][2] that Julien created is a pretty nice high level flow
>> as it:
>> 1. Mostly documents clearly what is already done in practice
>> 2. Postpones concerns and consensus about potentially overly restrictive
>> requirements for new features (but not trying to exhaustively specify the
>> criteria)
>> 3. Gives a location to list active proposals
>>
>> We could make progress with his PR without having to come to a consensus
>> on
>> the criteria for inclusion.
>>
>> Once we had that high level flow up,  we could try it out and formalize
>> some of the criteria that are used for changes.
>>
>> Andrew
>>
>>
>> [1]: https://github.com/apache/parquet-format/pull/513
>> [2]: https://github.com/apache/parquet-format/tree/proposals/proposals
>>
>> On Mon, Aug 11, 2025 at 12:31 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>> > >
>> > > In this situation, it's great to say that we want people to run
>> > benchmarks
>> > > on some representative datasets and I agree that we probably want a
>> > > substantial performance improvement to justify the cost of support.
>> But I
>> > > think we need to see these things as guidelines and not require
>> running
>> > 20
>> > >
>> > > The intention at least in the doc was to require 20 plus datasets but
>> to
>> > collect at least a list of open datasets that we can narrow down.  What
>> I
>> > would at least like to see is a fairly standard set of data to make
>> > comparisons consistent.   We also discussed this in the sync.  I think
>> it
>> > will be up to someone who has bandwidth to help at least designate a
>> subset
>> > of what we want to include.
>> >
>> > benchmarks or not considering features with 9% improvements across the
>> > > board.
>> >
>> > Sure, we can maybe make the language softer language on having a target
>> > percentage be a target goal but there can be trade-offs.
>> >
>> > I actually think having some sort of baseline helps to function as
>> making
>> > things easier in some ways as long as other requirements are met
>> because it
>> > removes some amount of subjectivity.
>> >
>> > Cheers,
>> > Micah
>> >
>> >
>> >
>> > On Fri, Aug 8, 2025 at 2:40 PM Julien Le Dem <[email protected]> wrote:
>> >
>> > > I agree that the goal is to make contributions easier and not a
>> daunting
>> > > process.
>> > > We could start the process by separating bigger projects that are
>> > impacting
>> > > the format in a non backward compatible way (new encodings, new
>> footer,
>> > > etc), versus things that are not as impacting (for example adding
>> > metadata
>> > > that can be ignored by older readers).
>> > > The goal of the "proposals" list I'm outlining above is really only
>> for
>> > > bigger projects where we need collaboration across the ecosystem
>> (like we
>> > > just did for Variant).
>> > > I'm taking inspiration from other projects here: Airflow Improvement
>> > > Proposals
>> > > <
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals
>> > > >
>> > >  or Flink Improvement Proposals
>> > > <
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/Flink/Flink+Improvement+Proposals
>> > > >
>> > > I think it's also useful to have a central place to find those.
>> > >
>> > > On Fri, Aug 8, 2025 at 12:11 PM Ryan Blue <[email protected]> wrote:
>> > >
>> > > > I like many things about the write up, but I want to call out one
>> > > potential
>> > > > pitfall.
>> > > >
>> > > > I think that this is needed so that we can evolve the project and
>> so we
>> > > > have a well-understood path for adding new encodings and
>> improvements.
>> > If
>> > > > we can't add new things, then the project will become outdated and
>> > > > irrelevant.
>> > > >
>> > > > I'd like to keep that goal in mind when discussing the path that we
>> are
>> > > > documenting because there is a risk of having the opposite effect:
>> by
>> > > > adding so much process or so many requirements to satisfy that
>> people
>> > > > choose not to contribute or can't make it through to the end.
>> > > >
>> > > > You can see this risk at play with many ASF projects that have a
>> > > > well-defined "path to committer". Often these docs start with
>> > guidelines
>> > > > that say something like "you'll generally need to contribute for
>> about
>> > a
>> > > > year" to give context, but those things turn into rules and the
>> > community
>> > > > doesn't add anyone that hasn't been around for a year.
>> > > >
>> > > > In this situation, it's great to say that we want people to run
>> > > benchmarks
>> > > > on some representative datasets and I agree that we probably want a
>> > > > substantial performance improvement to justify the cost of support.
>> > But I
>> > > > think we need to see these things as guidelines and not require
>> running
>> > > 20
>> > > > benchmarks or not considering features with 9% improvements across
>> the
>> > > > board.
>> > > >
>> > > > Ryan
>> > > >
>> > > > On Thu, Aug 7, 2025 at 5:10 PM Julien Le Dem <[email protected]>
>> > wrote:
>> > > >
>> > > > > I opened a Draft PR to illustrate what this could look like.
>> > > > > https://github.com/apache/parquet-format/pull/513
>> > > > > See in tree here:
>> > > > > https://github.com/apache/parquet-format/tree/proposals/proposals
>> > > > >
>> > > > > On Wed, Aug 6, 2025 at 3:30 PM Julien Le Dem <[email protected]>
>> > > wrote:
>> > > > >
>> > > > > > IMO, this doc is pretty close to being ready to be published. We
>> > can
>> > > > > > always improve it as we go.
>> > > > > >
>> > > > > > I think that one important part of the whole process is to make
>> it
>> > > easy
>> > > > > > for everyone to see what proposals are ongoing and their status
>> > and a
>> > > > > clear
>> > > > > > step to move from proposal/evaluation to implementation.
>> > > > > >
>> > > > > > Once we agree the doc is close enough, I would propose to
>> publish
>> > it
>> > > in
>> > > > > > markdown on the parquet-format repo, organized as follows:
>> > > > > > - The section "Baseline Requirements for new additions" as its
>> own
>> > > > page,
>> > > > > > documenting how to approach the design of a parquet change and
>> the
>> > > > > > underlying constraints.
>> > > > > > - We add a physical process to list proposals in the
>> parquet-format
>> > > > > github
>> > > > > > Repo as follows.
>> > > > > > - The steps described in the section "Incorporating
>> > > > encoding/compression
>> > > > > > improvements" become the process on how someone creates a
>> proposal
>> > > and
>> > > > > > starts a POC.
>> > > > > > - I would complement it by the following steps for people to
>> > publish
>> > > > > their
>> > > > > > proposals:
>> > > > > >    - We create a folder in the parquet-format repo to hold the
>> > > > proposals.
>> > > > > >    - a Readme in the folder tracks the ongoing POCs and status.
>> > > > > >    - Initiating a proposal starts with a github issue. We
>> create a
>> > > > > > template for it based on what's outlined in that section of the
>> > doc.
>> > > > > >    - If the discussion concludes that the proposal is worth a
>> POC,
>> > > > > > the author opens a PR to add the proposal in markdown in the
>> > > proposals
>> > > > > > folder. It links to the Github issue where the discussion
>> preceding
>> > > the
>> > > > > > proposal occurred. More people can contribute to the POC as
>> needed.
>> > > > > >    - POC and perf evaluation are implemented as part of the
>> > proposal.
>> > > > > >    - a vote by the PMC moves the proposal to actual feature in
>> the
>> > > > format
>> > > > > > (based on the criteria outlined in this doc).
>> > > > > >    - As part of the implementation step, we make sure we have
>> cross
>> > > > > > compatible implementations as we did for Variant.
>> > > > > > - The section "Measuring improvements" becomes part of that
>> process
>> > > > > > section to explain how we'll decide if the addition is worth
>> adding
>> > > to
>> > > > > the
>> > > > > > spec for the complexity it is adding.
>> > > > > >
>> > > > > > If that makes sense to you all, I can draft a PR to make this
>> > > proposal
>> > > > a
>> > > > > > little more concrete.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Wed, Aug 6, 2025 at 11:08 AM Andrew Lamb <
>> > [email protected]>
>> > > > > > wrote:
>> > > > > >
>> > > > > >> I would like to bump this thread as it came up again on the
>> > parquet
>> > > > sync
>> > > > > >> call today
>> > > > > >>
>> > > > > >> Specifically, it seems like there is increasing interest in
>> adding
>> > > new
>> > > > > >> encodings to the Parquet, so getting consensus on what that
>> > process
>> > > > > looks
>> > > > > >> like and what is required is more important.
>> > > > > >>
>> > > > > >> If you are interested in this topic, please leave comments on
>> the
>> > > > Google
>> > > > > >> Doc[1] or reply to this email chain.
>> > > > > >>
>> > > > > >> Thank you,
>> > > > > >> Andrew
>> > > > > >>
>> > > > > >> [1]
>> > > > > >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>> > > > > >>
>> > > > > >> On Thu, May 29, 2025 at 2:42 AM Micah Kornfield <
>> > > > [email protected]>
>> > > > > >> wrote:
>> > > > > >>
>> > > > > >> > I wrote up a long overdue draft
>> > > > > >> > <
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>> > > > > >> > >
>> > > > > >> > [1]
>> > > > > >> > on how we can move forward with additional features (it
>> provides
>> > > > some
>> > > > > >> > proposed requirements on both consuming third-party code, as
>> > well
>> > > as
>> > > > > >> some
>> > > > > >> > more specific guidance on new encodings, and some orthogonal
>> > work
>> > > > that
>> > > > > >> > would be nice to see).
>> > > > > >> >
>> > > > > >> > The doc still lacks some details, and might be too
>> opinionated
>> > in
>> > > > > places
>> > > > > >> > but I think it serves as a good basis for conversation (and
>> at
>> > > least
>> > > > > >> gets
>> > > > > >> > me out of the critical path for evolving Parquet).
>> > > > > >> >
>> > > > > >> > I'm very excited to start moving forward with improvements.
>> > > > > >> >
>> > > > > >> > Thanks,
>> > > > > >> > Micah
>> > > > > >> >
>> > > > > >> > [1]
>> > > > > >> >
>> > > > > >> >
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://docs.google.com/document/d/1qGDnOyoNyPvcN4FCRhbZGAvp0SfewlWo-WVsai5IKUo/edit?tab=t.0
>> > > > > >> >
>> > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [Discuss] Feature addition requirements/process

Reply via email to