Hi all, We tested whether various Parquet readers depend on path_in_schema
(in ColumnMetaData) to understand the impact of deprecating it. We tested
parquet-mr, Databricks, DuckDB, ClickHouse, Snowflake, and Fabric. The
result is that parquet-mr and Fabric use path_in_schema as a hard
dependency and cannot read files without it. Databricks supports reading
files without path_in_schema in newer versions, and the other readers
support reading them in their latest versions. That said, deprecating the
field is hard in the current Thrift-based Parquet spec and would require
years of effort for the ecosystem to adopt the change. A reminder that this
field is a list of strings and contributes heavily to footer bloat. We've
seen footers as large as 367 MB in production, with over 60% of the size
coming from this field alone. On the other hand, the FlatBuffer footer
proposal gives us the ability to not only decode the schema in an efficient
way but also remove redundant fields completely without breaking
compatibility with the existing Thrift footer. It doesn't introduce a
breaking change, but also doesn't slow down our ability to evolve the
footer. It buys us time to embrace the change across the entire ecosystem.
Best, Jiayi

On 2026/03/30 17:59:32 Ed Seidl wrote:
> Thanks for the perspective, Alkis. I'd just like to add a few comments.
>
> On 2026/03/27 13:37:46 Alkis Evlogimenos via dev wrote:
> > 1. Dedup. The Thrift footer repeats path_in_schema (a list of strings)
for
> > every column in every row group. For a 10K-column, 4-RG file that's 40K
> > string lists and it's the single biggest source of footer bloat. The
> > FlatBuffer footer drops it entirely โ€” it's derivable from schema +
column
> > ordinal. Same for type (already in the schema), the full encodings list,
> > and encoding_stats (replaced by a single bool).
>
> I agree path_in_schema is pretty useless, but we could just make that
field optional. Yes this would break old readers, but then so would adding
a new encoding or compression codec. Old readers can't be expected to work
forever.
>
> > 2. Compact stats. Thrift Statistics stores min/max as variable-length
> > binary with per-field framing. The FlatBuffer footer uses fixed-width
> > integers for numeric types and a prefix+truncated-suffix scheme for byte
> > arrays. Across thousands of columns this adds up.
>
> I also agree the statistics are a mess. But then, I think a bigger
problem is overpopulation of the statistics. There is very little benefit
to simple min/max statistics on unsorted columns. If writers were a little
more conservative and simply omitted these optional statistics for columns
that have no chance of benefiting from them that would reduce a great deal
of bloat.
>
> > 3. Dropped dead weight. ConvertedType, deprecated min/max,
distinct_count,
> > SizeStatistics
>
> I'll grant the first two, but already I've seen calls to do something
with distinct_count, and I personally use the size statistics, so I do not
agree to the "dead weight" label for those. I do agree that their current
form is not ideal, but was a compromise at the time. I think one benefit of
the flatbuffers work would be to separate out metadata needed for
traversing the file from metadata supporting indexes/other purposes. If we
can easily add new specialized structures that are easy to ignore I think
that would be a win.
>
> > A jump table into the existing Thrift footer preserves all of this
> > duplication and bloat. You still have to decode the same fat
ColumnMetaData
> > structs, you just get to skip to the right one faster.
>
> Given that most of the ColumnMetaData bloat is at the tail end of the
struct, the jump table allows for stopping parsing early and skipping to
the next column. No need to parse the bloat, but it is still there.
>
> > And the index itself
> > adds at least 12 bytes plus framing per column per row group (you need
> > offset+length since Thrift fields are variable-width), so the total
footer
> > actually gets bigger.
>
> Not quite. Given that row groups and column chunks are serialized
back-to-back, one simply needs N+1 offsets, the lengths can then be
derived. Alternatively, if we use 0 offsets for the start of the row groups
and the first column chunk in a row group, you could instead just encode N
lengths and do an exclusive scan to deduce the offsets. This would allow
for using fewer bytes to encode the lengths at the expense of a little more
computation when instantiating the table.
>
> > Now, if we accept a breaking change is needed to meaningfully shrink the
> > footer, then why not break into a format that also gives us zero-copy
> > access natively?
>
> I do agree that if we are going to completely redo the metadata, then why
not change to flatbuffers, so long as we're good with the trade offs
(zero-copy and random access for larger representations).
>
> Cheers,
> Ed
>
>
>

Reply via email to