> I don't understand how it's useful: > 1) At this point it's too late, the Parquet file was written already, so > this is not solving the user's problem of "how do I choose a safe > feature set".
In my mind, the format version is exactly a shared vocabulary for readers and writers to agree on a safe feature set. For example if a writer wants to ensure Spark 4.0 can read their files, (I am making up version numbers), they look up and find that spark supports features in parquet-format 2.11 and restrict themselves to just those features. > 2) Let's say Parquet 2.34 introduces features A and B. Let's also say a > Parquet reader implements feature A but not feature B. What should this > reader do if you give it a file that has version 2.34 recorded in the > metadata? Should it error out (but perhaps the file only uses feature > A)? Or should it not error out (but perhaps the file uses feature B)? I would suggest: 1. Basic readers: error out (simplest to code, and easiest to explain the behavior, even though some readable files may be rejected), with a user defined "ignore version" field 2. Advanced readers: try and check the file for features in 2.34 that it doesn't support (e.g. the use of the new ALP encoding) and error if present In this way, if a reader advertises it supports version 2.8 of the spec, then writers can use any of those features, and there is no confusion about read compatibility. I agree this is a coarse system, and may mean the features in some readers may not be used. For more advanced uscases and readers without complete support, the writer could do more nuanced research about what extra flags / features to enable We can probably come up with other more precise ways to communicate individual feature support (feature buckets, feature matrices, etc) but they all seem complicated (and require non trivial consensus on what constitutes "major features", for example) Andrew On Tue, Jun 9, 2026 at 12:46 PM Antoine Pitrou <[email protected]> wrote: > Le 09/06/2026 à 18:28, Andrew Lamb a écrit : > > > >> Aren't we moving the goalposts here? > >> IIRC the basis for this discussion was to inform Parquet *writers* about > >> which features can safely be enabled. Recording the format version in a > >> Parquet file's metadata does not help achieve that. > > > > In my mind they are connected -- recording the format in the metadata > would > > allow writers to explicltly communicate to downstream readers which > > features are required for reading, > > I don't understand how it's useful: > > 1) At this point it's too late, the Parquet file was written already, so > this is not solving the user's problem of "how do I choose a safe > feature set". > > 2) Let's say Parquet 2.34 introduces features A and B. Let's also say a > Parquet reader implements feature A but not feature B. What should this > reader do if you give it a file that has version 2.34 recorded in the > metadata? Should it error out (but perhaps the file only uses feature > A)? Or should it not error out (but perhaps the file uses feature B)? > > Regards > > Antoine. > > >
