Re: Making path_in_schema optional

Micah Kornfield Thu, 18 Jun 2026 16:31:07 -0700

Thanks Alkis for highlighting this.

In our testing with a reader that ignores path_in_schema we have found that
> there are writers in the wild that do not follow the spec but
> path_in_schema saves them.



Can you clarify how this saves them.  My impression from the analysis done
for this change is that parquet-java and parquet-cpp don't rely on the
path_in_schema field to do projection. Was this proprietary functionality?

Do you happen to know which writer produced the parquet files (if it is an
open source one maybe we can open a bug).


The example is a parquet file with N leaf schema
> elements and K column metadata per row group, where K < N. If one resolves
> with path_in_schema the selected columns are found and work. If one matches
> with schema element order - chaos ensues.


Based on the above, it sounds like this would probably be a problem based
on the current resolution order anyways (i.e. at least path_in_schema is
not used in the projection path).

To err to the side of caution we should not do this change lightly. We need
> a version change to drop this field otherwise we risk failed reads and even
> worse data loss. Consider the case of many INT32 columns, where one of them
> is missing in column metadata. If index based resolution lands in the wrong
> column but the type matches it will happily read it even though it is the
> wrong column.


We are discussing a versioning scheme separately.  Specifically for this
change, my understanding  is that any parser faithfully parsing thrift
today would fail hard for reads when the field is missing, since thrift
would validate required fields are present (which this would not be).  I'm
not sure if we've actually tested this, but are you familiar with thrift
parsers that wouldn't fail when this field is missing?

Thanks,
Micah

On Thu, Jun 18, 2026 at 11:46 AM Alkis Evlogimenos via dev <
[email protected]> wrote:

> One data point from the fleet.
>
> In our testing with a reader that ignores path_in_schema we have found that
> there are writers in the wild that do not follow the spec but
> path_in_schema saves them. The example is a parquet file with N leaf schema
> elements and K column metadata per row group, where K < N. If one resolves
> with path_in_schema the selected columns are found and work. If one matches
> with schema element order - chaos ensues.
>
> To err to the side of caution we should not do this change lightly. We need
> a version change to drop this field otherwise we risk failed reads and even
> worse data loss. Consider the case of many INT32 columns, where one of them
> is missing in column metadata. If index based resolution lands in the wrong
> column but the type matches it will happily read it even though it is the
> wrong column.
>
> On Fri, May 29, 2026 at 6:14 AM Ed Seidl <[email protected]> wrote:
>
> > Hi all,
> > Quick update on this. A third PoC implementation in arrow-cpp has been
> > created [1], and a file
> > without the path_in_schema field (created with arrow-rs) has been
> > submitted to parquet-testing [2]. I've confirmed that the java and cpp
> PoCs
> > can properly read the file. I'll be proposing a vote on this proposal
> soon
> > if no objections are raised here or in the PR [3].
> >
> > Cheers,
> > Ed
> >
> > [1] https://github.com/apache/arrow/pull/49707
> > [2] https://github.com/apache/parquet-testing/pull/108
> > [3] https://github.com/apache/parquet-format/pull/564
> >
> > On 2026/04/22 20:58:46 Micah Kornfield wrote:
> > > I need to review the implementations more carefully, but I think this
> > looks
> > > good.  Maybe we should give people through next week for people to
> review
> > > and then we can start a vote?
> > >
> > > On Wed, Apr 22, 2026 at 1:45 PM Steve Loughran <[email protected]>
> > wrote:
> > >
> > > > following on from the discussion today
> > > >
> > > >
> > > >    1. I can see the benefits in tagging it as optional
> > > >    2. it would be a long time before the systems I field support
> calls
> > over
> > > >    would stop generating it because we don't know where data would
> end
> > up
> > > >    being used.
> > > >    3. For those people who are encountering major problems here, it
> > would
> > > >    at least be possible to say "provided you intend to only work with
> > > > versions
> > > >    of <product> dated 2027 or newer, all is good.
> > > >
> > > > making the field optional as soon as possible would increase the time
> > at
> > > > which parquet releases can actually stop adding the field.
> > > >
> > > > Being able to tie it to a non-backwards-compatible database change
> > (and I'm
> > > > thinking Iceberg v4 tables) would provide a clear way to scope that
> > > > incompatibility. Imagine if iceberg was set up to turn the feature of
> > when
> > > > generating files for v4 tables, knowing all applications which could
> > read
> > > > the tables wouldn't need path_in_schema. *regardless of the language
> of
> > > > that implementation*
> > > >
> > > > steve
> > > >
> > > > On Mon, 20 Apr 2026 at 09:34, Gang Wu <[email protected]> wrote:
> > > >
> > > > > Thanks Ed for raising this!
> > > > >
> > > > > Overall I'm +1 to this. We need input from others since it is a
> > slight
> > > > > breaking change.
> > > > >
> > > > > Best,
> > > > > Gang
> > > > >
> > > > > On Thu, Apr 9, 2026 at 9:41 PM Ed Seidl <[email protected]> wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > Following a lively discussion on this list, I thought I’d take a
> > stab
> > > > at
> > > > > > addressing one pain point in the Parquet footer. I’ve put up a
> > proposal
> > > > > [1]
> > > > > > and PR [2] to switch path_in_schema in the ColumnMetaData from
> > > > “required”
> > > > > > to “optional”. I’ve also whipped up PoCs in Rust [3] and Java
> [4].
> > > > > >
> > > > > > Please take a look and let’s discuss in the PR.
> > > > > >
> > > > > > Thanks,
> > > > > > Ed
> > > > > >
> > > > > > [1] https://github.com/apache/parquet-format/issues/563
> > > > > > [2] https://github.com/apache/parquet-format/pull/564
> > > > > > [3] https://github.com/apache/arrow-rs/pull/9678
> > > > > > [4] https://github.com/apache/parquet-java/pull/3470
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Making path_in_schema optional

Reply via email to