Hi Claire, I agree that supporting all "legacy" list encodings is painful and it has caused troubles in the past.
It seems that parquet-cli mainly depends on parquet-avro so it also requires settings from parquet-avro to resolve list structure. Perhaps we can do something similar to what parquet-cpp currently does for list encoding resolution [1], which does not require extra information other than the MessageType. [1] https://github.com/apache/arrow/blob/976d547fba9b4bff4178e515ca8cdcb8a5db4d46/cpp/src/parquet/arrow/schema.cc#L706-L790 Best, Gang On Wed, Apr 1, 2026 at 2:08 AM Claire McGinty <[email protected]> wrote: > Hi all, > > I wanted to bring up the topic of Parquet's supported encodings for List > logical types > < > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists > > > . > > Having multiple valid List encodings is becoming a pain point for my org, > especially since we read and write Parquet from different engines with > different default values (for example, Ray/pyarrow > < > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html > > > writes Parquet lists using the latest 3-level list encoding; writes from > Scio <https://spotify.github.io/scio/io/Parquet.html> use the default > parquet-avro encoding, which uses an older encoding; we even have a few > datasets with primitive required list types that just encode using one > level, e.g. `repeated int32 my_element`). > > Parquet-cli > <https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md> > also doesn't work out of the box for all these encoding types, unless you > manually specify a Configuration file specifying the encoding. Overall, > it's frustrating for our users reading these files to have to look up the > write schema, then look up the right Configuration key, then figure out how > to pass in that Configuration to parquet-cli or parquet-avro. > > So I'm wondering if there'd be any interest in: > > - Contributing a public utility method (to parquet-common? Or maybe > there's a better place for it) that accepts either a Parquet > `MessageType` > or a `Path` and detects which type of List encoding is being used. > (This is > probably easier said than done, but at least the backwards-compatibility > rules > < > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules > > > are > finite and clear to interpret.) > - Integrating that utility method into parquet-cli/parquet-avro, as well > as any other parquet formats that support Lists (i.e. magnolify-parquet > <https://spotify.github.io/magnolify/parquet.html>). > > One potential corner case I can think of is that I guess if you're manually > specifying your Parquet schema (rather than using an established format > like parquet-avro), there's nothing preventing you from mixing and matching > list encodings. But we could just have the utility method throw an > exception in that case and force the user to specify a schema explicitly. > > Thanks, and let me know what you think, > Claire >
