Hi Claire,

I agree that supporting all "legacy" list encodings is painful and it has
caused troubles in the past.

It seems that parquet-cli mainly depends on parquet-avro so it also requires
settings from parquet-avro to resolve list structure. Perhaps we can do
something similar to what parquet-cpp currently does for list encoding
resolution [1], which does not require extra information other than the
MessageType.

[1]
https://github.com/apache/arrow/blob/976d547fba9b4bff4178e515ca8cdcb8a5db4d46/cpp/src/parquet/arrow/schema.cc#L706-L790


Best,
Gang

On Wed, Apr 1, 2026 at 2:08 AM Claire McGinty <[email protected]>
wrote:

> Hi all,
>
> I wanted to bring up the topic of Parquet's supported encodings for List
> logical types
> <
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
> >
> .
>
> Having multiple valid List encodings is becoming a pain point for my org,
> especially since we read and write Parquet from different engines with
> different default values (for example, Ray/pyarrow
> <
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
> >
> writes Parquet lists using the latest 3-level list encoding; writes from
> Scio <https://spotify.github.io/scio/io/Parquet.html> use the default
> parquet-avro encoding, which uses an older encoding; we even have a few
> datasets with primitive required list types that just encode using one
> level, e.g. `repeated int32 my_element`).
>
> Parquet-cli
> <https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md>
> also doesn't work out of the box for all these encoding types, unless you
> manually specify a Configuration file specifying the encoding. Overall,
> it's frustrating for our users reading these files to have to look up the
> write schema, then look up the right Configuration key, then figure out how
> to pass in that Configuration to parquet-cli or parquet-avro.
>
> So I'm wondering if there'd be any interest in:
>
>    - Contributing a public utility method (to parquet-common? Or maybe
>    there's a better place for it) that accepts either a Parquet
> `MessageType`
>    or a `Path` and detects which type of List encoding is being used.
> (This is
>    probably easier said than done, but at least the backwards-compatibility
>    rules
>    <
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
> >
> are
>    finite and clear to interpret.)
>    - Integrating that utility method into parquet-cli/parquet-avro, as well
>    as any other parquet formats that support Lists (i.e. magnolify-parquet
>    <https://spotify.github.io/magnolify/parquet.html>).
>
> One potential corner case I can think of is that I guess if you're manually
> specifying your Parquet schema (rather than using an established format
> like parquet-avro), there's nothing preventing you from mixing and matching
> list encodings. But we could just have the utility method throw an
> exception in that case and force the user to specify a schema explicitly.
>
> Thanks, and let me know what you think,
> Claire
>

Reply via email to