Re: Auto-detecting list encodings in Parquet files

Gang Wu Tue, 14 Apr 2026 02:49:13 -0700

IMO, this is a common feature not tied to avro so parquet-column looks
better.


On Tue, Apr 14, 2026 at 4:03 AM Claire McGinty <[email protected]>
wrote:

> Hey Gang, thanks for linking the Arrow code! That functionality would be
> great to have in parquet-java. Would you see it living in the parquet-avro
> reader code specifically (and therefore picked up by parquet-cli), or added
> to the core reader functionality in parquet-column?
>
> - Claire
>
> On Wed, Apr 1, 2026 at 10:22 PM Gang Wu <[email protected]> wrote:
>
> > Hi Claire,
> >
> > I agree that supporting all "legacy" list encodings is painful and it has
> > caused troubles in the past.
> >
> > It seems that parquet-cli mainly depends on parquet-avro so it also
> > requires
> > settings from parquet-avro to resolve list structure. Perhaps we can do
> > something similar to what parquet-cpp currently does for list encoding
> > resolution [1], which does not require extra information other than the
> > MessageType.
> >
> > [1]
> >
> >
> https://github.com/apache/arrow/blob/976d547fba9b4bff4178e515ca8cdcb8a5db4d46/cpp/src/parquet/arrow/schema.cc#L706-L790
> >
> >
> > Best,
> > Gang
> >
> > On Wed, Apr 1, 2026 at 2:08 AM Claire McGinty <
> [email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I wanted to bring up the topic of Parquet's supported encodings for
> List
> > > logical types
> > > <
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
> > > >
> > > .
> > >
> > > Having multiple valid List encodings is becoming a pain point for my
> org,
> > > especially since we read and write Parquet from different engines with
> > > different default values (for example, Ray/pyarrow
> > > <
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html
> > > >
> > > writes Parquet lists using the latest 3-level list encoding; writes
> from
> > > Scio <https://spotify.github.io/scio/io/Parquet.html> use the default
> > > parquet-avro encoding, which uses an older encoding; we even have a few
> > > datasets with primitive required list types that just encode using one
> > > level, e.g. `repeated int32 my_element`).
> > >
> > > Parquet-cli
> > > <
> > https://github.com/apache/parquet-java/blob/master/parquet-cli/README.md
> >
> > > also doesn't work out of the box for all these encoding types, unless
> you
> > > manually specify a Configuration file specifying the encoding. Overall,
> > > it's frustrating for our users reading these files to have to look up
> the
> > > write schema, then look up the right Configuration key, then figure out
> > how
> > > to pass in that Configuration to parquet-cli or parquet-avro.
> > >
> > > So I'm wondering if there'd be any interest in:
> > >
> > >    - Contributing a public utility method (to parquet-common? Or maybe
> > >    there's a better place for it) that accepts either a Parquet
> > > `MessageType`
> > >    or a `Path` and detects which type of List encoding is being used.
> > > (This is
> > >    probably easier said than done, but at least the
> > backwards-compatibility
> > >    rules
> > >    <
> > >
> >
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules
> > > >
> > > are
> > >    finite and clear to interpret.)
> > >    - Integrating that utility method into parquet-cli/parquet-avro, as
> > well
> > >    as any other parquet formats that support Lists (i.e.
> > magnolify-parquet
> > >    <https://spotify.github.io/magnolify/parquet.html>).
> > >
> > > One potential corner case I can think of is that I guess if you're
> > manually
> > > specifying your Parquet schema (rather than using an established format
> > > like parquet-avro), there's nothing preventing you from mixing and
> > matching
> > > list encodings. But we could just have the utility method throw an
> > > exception in that case and force the user to specify a schema
> explicitly.
> > >
> > > Thanks, and let me know what you think,
> > > Claire
> > >
> >
>

Re: Auto-detecting list encodings in Parquet files

Reply via email to