IMO, this is a common feature not tied to avro so parquet-column looks better.
On Tue, Apr 14, 2026 at 4:03 AM Claire McGinty <[email protected]> wrote: > Hey Gang, thanks for linking the Arrow code! That functionality would be > great to have in parquet-java. Would you see it living in the parquet-avro > reader code specifically (and therefore picked up by parquet-cli), or added > to the core reader functionality in parquet-column? > > - Claire > > On Wed, Apr 1, 2026 at 10:22 PM Gang Wu <[email protected]> wrote: > > > Hi Claire, > > > > I agree that supporting all "legacy" list encodings is painful and it has > > caused troubles in the past. > > > > It seems that parquet-cli mainly depends on parquet-avro so it also > > requires > > settings from parquet-avro to resolve list structure. Perhaps we can do > > something similar to what parquet-cpp currently does for list encoding > > resolution [1], which does not require extra information other than the > > MessageType. > > > > [1] > > > > > https://github.com/apache/arrow/blob/976d547fba9b4bff4178e515ca8cdcb8a5db4d46/cpp/src/parquet/arrow/schema.cc#L706-L790 > > > > > > Best, > > Gang > > > > On Wed, Apr 1, 2026 at 2:08 AM Claire McGinty < > [email protected]> > > wrote: > > > > > Hi all, > > > > > > I wanted to bring up the topic of Parquet's supported encodings for > List > > > logical types > > > < > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists > > > > > > > . > > > > > > Having multiple valid List encodings is becoming a pain point for my > org, > > > especially since we read and write Parquet from different engines with > > > different default values (for example, Ray/pyarrow > > > < > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html > > > > > > > writes Parquet lists using the latest 3-level list encoding; writes > from > > > Scio <https://spotify.github.io/scio/io/Parquet.html> use the default > > > parquet-avro encoding, which uses an older encoding; we even have a few > > > datasets with primitive required list types that just encode using one > > > level, e.g. `repeated int32 my_element`). > > > > > > Parquet-cli > > > < > > https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md > > > > > also doesn't work out of the box for all these encoding types, unless > you > > > manually specify a Configuration file specifying the encoding. Overall, > > > it's frustrating for our users reading these files to have to look up > the > > > write schema, then look up the right Configuration key, then figure out > > how > > > to pass in that Configuration to parquet-cli or parquet-avro. > > > > > > So I'm wondering if there'd be any interest in: > > > > > > - Contributing a public utility method (to parquet-common? Or maybe > > > there's a better place for it) that accepts either a Parquet > > > `MessageType` > > > or a `Path` and detects which type of List encoding is being used. > > > (This is > > > probably easier said than done, but at least the > > backwards-compatibility > > > rules > > > < > > > > > > https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules > > > > > > > are > > > finite and clear to interpret.) > > > - Integrating that utility method into parquet-cli/parquet-avro, as > > well > > > as any other parquet formats that support Lists (i.e. > > magnolify-parquet > > > <https://spotify.github.io/magnolify/parquet.html>). > > > > > > One potential corner case I can think of is that I guess if you're > > manually > > > specifying your Parquet schema (rather than using an established format > > > like parquet-avro), there's nothing preventing you from mixing and > > matching > > > list encodings. But we could just have the utility method throw an > > > exception in that case and force the user to specify a schema > explicitly. > > > > > > Thanks, and let me know what you think, > > > Claire > > > > > >
