> I don't understand how it's useful:
> 1) At this point it's too late, the Parquet file was written already, so
> this is not solving the user's problem of "how do I choose a safe
> feature set".

In my mind, the format version is exactly a shared vocabulary for readers
and writers to agree on a safe feature set.

For  example if a writer wants to ensure Spark 4.0 can read their files, (I
am making up version numbers), they look up and find that spark supports
features in parquet-format 2.11 and restrict themselves to just those
features.

> 2) Let's say Parquet 2.34 introduces features A and B. Let's also say a
> Parquet reader implements feature A but not feature B. What should this
> reader do if you give it a file that has version 2.34 recorded in the
> metadata? Should it error out (but perhaps the file only uses feature
> A)? Or should it not error out (but perhaps the file uses feature B)?

I would suggest:
1. Basic readers: error out (simplest to code, and easiest to explain the
behavior, even though some readable files may be rejected), with a user
defined "ignore version" field
2. Advanced readers:  try and check the file for features in 2.34 that it
doesn't support (e.g. the use of the new ALP encoding) and error if present

In this way, if a reader advertises it supports version 2.8 of the spec,
then writers can use any of those features, and there is no confusion about
read compatibility. I agree this is a coarse system, and may mean the
features in some readers may not be used.

For more advanced uscases and readers without complete support, the writer
could do more nuanced research about what extra flags / features to enable

We can probably come up with other more precise ways to communicate
individual feature support (feature buckets, feature matrices, etc) but
they all seem complicated (and require non trivial consensus on what
constitutes "major features", for example)

Andrew

On Tue, Jun 9, 2026 at 12:46 PM Antoine Pitrou <[email protected]> wrote:

> Le 09/06/2026 à 18:28, Andrew Lamb a écrit :
> >
> >> Aren't we moving the goalposts here?
> >> IIRC the basis for this discussion was to inform Parquet *writers* about
> >> which features can safely be enabled. Recording the format version in a
> >> Parquet file's metadata does not help achieve that.
> >
> > In my mind they are connected -- recording the format in the metadata
> would
> > allow writers to explicltly communicate to downstream readers which
> > features are required for reading,
>
> I don't understand how it's useful:
>
> 1) At this point it's too late, the Parquet file was written already, so
> this is not solving the user's problem of "how do I choose a safe
> feature set".
>
> 2) Let's say Parquet 2.34 introduces features A and B. Let's also say a
> Parquet reader implements feature A but not feature B. What should this
> reader do if you give it a file that has version 2.34 recorded in the
> metadata? Should it error out (but perhaps the file only uses feature
> A)? Or should it not error out (but perhaps the file uses feature B)?
>
> Regards
>
> Antoine.
>
>
>

Reply via email to