> The problem I think we're trying to solve is to make it easier and safer
> for users to enable modern features that produce more optimized and more
> efficient Parquet files.

I agree

> I agree with the "non-trivial consensus" problem, and that's the point
> of calendar-based presets: they eschew the need for "non-trivial
> consensus" as they are based on actual adoption. :-)

To be clear I am not opposed to presets (or some other schemes to make
adoption clearer)

In fact, as perhaps you are hinting at, the implementation status page[1]
already has a table with yearly adoption ("Minimum Version for Read Support
by Year"). Perhaps that is enough

Andrew


[1]: https://parquet.apache.org/docs/file-format/implementationstatus/

On Wed, Jun 10, 2026 at 3:00 AM Antoine Pitrou <[email protected]> wrote:

>
> Le 09/06/2026 à 19:19, Andrew Lamb a écrit :
> >> I don't understand how it's useful:
> >> 1) At this point it's too late, the Parquet file was written already, so
> >> this is not solving the user's problem of "how do I choose a safe
> >> feature set".
> >
> > In my mind, the format version is exactly a shared vocabulary for readers
> > and writers to agree on a safe feature set.
> >
> > For  example if a writer wants to ensure Spark 4.0 can read their files,
> (I
> > am making up version numbers), they look up and find that spark supports
> > features in parquet-format 2.11 and restrict themselves to just those
> > features.
>
> What if Spark supports some features from 2.12, but doesn't support all
> the features from 2.11 (or even 2.6), for example?
>
> Historically it's been quite common to have this kind of jagged feature
> adoption where implementations do not necessarily implement features in
> the chronological order of their appearance in parquet-format. Just
> because something is in parquet-format doesn't mean it will get wide
> adoption.
>
> (perhaps some Parquet readers still don't implement modular encryption,
> for example? and let's not talk about INT96 timestamps or LZO
> compression...)
>
> >> 2) Let's say Parquet 2.34 introduces features A and B. Let's also say a
> >> Parquet reader implements feature A but not feature B. What should this
> >> reader do if you give it a file that has version 2.34 recorded in the
> >> metadata? Should it error out (but perhaps the file only uses feature
> >> A)? Or should it not error out (but perhaps the file uses feature B)?
> >
> > I would suggest:
> > 1. Basic readers: error out (simplest to code, and easiest to explain the
> > behavior, even though some readable files may be rejected), with a user
> > defined "ignore version" field
> > 2. Advanced readers:  try and check the file for features in 2.34 that it
> > doesn't support (e.g. the use of the new ALP encoding) and error if
> present
>
> Readers *already* error out when then encounter an unknown encoding in a
> column they are asked to reader. What do we gain by having them *also*
> check a version number?
>
> > For more advanced uscases and readers without complete support, the
> writer
> > could do more nuanced research about what extra flags / features to
> enable
>
> This is the statu quo, and it doesn't work well as users generally
> settle on the conservative defaults exposed by mainstream writers.
>
> The problem I think we're trying to solve is to make it easier and safer
> for users to enable modern features that produce more optimized and more
> efficient Parquet files.
>
> > We can probably come up with other more precise ways to communicate
> > individual feature support (feature buckets, feature matrices, etc) but
> > they all seem complicated (and require non trivial consensus on what
> > constitutes "major features", for example)
>
> I agree with the "non-trivial consensus" problem, and that's the point
> of calendar-based presets: they eschew the need for "non-trivial
> consensus" as they are based on actual adoption. :-)
>
> Regards
>
> Antoine.
>
>
>

Reply via email to