Re: [DISCUSS] Alternative to FlatBuffer Footer: A Lightweight Byte-Offset Index

Andrew Lamb Thu, 09 Apr 2026 03:36:52 -0700

I don't think the work on ALP is blocked on some new format, I think it is
just waiting for some example files in parquet-testing and some final votes.


It seems like at very least we could do is make a PR to parquet-java to
stop needing `path_in_schema`.

And then we could add some info to the spec explaining older readers might
need it, but newer readers might choose not to populate it if they wanted
to reduce their metadata size (at the expense of older readers not being
able to read it)

Andrew

On Wed, Apr 8, 2026 at 12:17 PM Ed Seidl <[email protected]> wrote:

> Hi Jiayi,
>
> This is getting a little off topic, but I did a quick test to see how much
> would
> be involved in getting some major implementations to support an optional
> path_in_schema.
>
> In short, parquet-java required changing a single constructor call to using
> a setter, and replacing two accesses of the path_in_schema with an already
> available array of paths from the schema metadata. arrow-cpp
> required no code changes at all beyond regenerating the thrift structures.
> And as mentioned previously, arrow-rs has never used the field at all.  As
> to performance, removing the field from a 10000 column flat schema saved
> around 2MB out of 11, so a 17% reduction. Parsing time in arrow-rs improved
> only about 3% since the field is simply skipped if encountered anyway, so
> no allocations are saved. I haven't tried benchmarking the other
> implementations.
>
> So I don't think it's going to take years of effort to deprecate that
> field. Of course,
> the guidelines for forward-incompatible changes [1] will need to be
> followed,
> so it will take some time for the changes to ripple through the ecosystem,
> but
> users would have the ability to save a good bit of space by turning the
> unused
> field off themselves.
>
> If the field is so damaging, I simply don't see why we need to wait any
> longer to
> remove it. Just because the v3 proposal exists doesn't mean all work on the
> current format needs to halt. Will we forestall work on new encodings like
> ALP
> until v3 is ready to go? I hope we won't make the perfect the enemy of the
> good
> here.
>
> Cheers,
> Ed
>
>
> [1]
> https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format
>
> On 2026/04/07 12:41:35 王嘉仪 wrote:
> > Hi all,
> >
> > We tested whether various Parquet readers depend on path_in_schema (in
> > ColumnMetaData) to understand the impact of deprecating it.
> >
> > We tested parquet-mr, Databricks, DuckDB, ClickHouse, Snowflake, and
> > Fabric. The result is that parquet-mr and Fabric use path_in_schema as a
> > hard dependency and cannot read files without it. Databricks supports
> > reading files without path_in_schema in newer versions, and the other
> > readers support reading them in their latest versions.
> >
> > That said, deprecating the field is hard in the current Thrift-based
> > Parquet spec and would require years of effort for the ecosystem to adopt
> > the change. A reminder that this field is a list of strings and
> contributes
> > heavily to footer bloat. We've seen footers as large as 367 MB in
> > production, with over 60% of the size coming from this field alone.
> >
> > On the other hand, the FlatBuffer footer proposal gives us the ability to
> > not only decode the schema in an efficient way but also remove redundant
> > fields completely without breaking compatibility with the existing Thrift
> > footer. It doesn't introduce a breaking change, but also doesn't slow
> down
> > our ability to evolve the footer. It buys us time to embrace the change
> > across the entire ecosystem.
> >
> > Best, Jiayi
> >
>

Re: [DISCUSS] Alternative to FlatBuffer Footer: A Lightweight Byte-Offset Index

Reply via email to