Hi all,

We tested whether various Parquet readers depend on path_in_schema (in
ColumnMetaData) to understand the impact of deprecating it.

We tested parquet-mr, Databricks, DuckDB, ClickHouse, Snowflake, and
Fabric. The result is that parquet-mr and Fabric use path_in_schema as a
hard dependency and cannot read files without it. Databricks supports
reading files without path_in_schema in newer versions, and the other
readers support reading them in their latest versions.

That said, deprecating the field is hard in the current Thrift-based
Parquet spec and would require years of effort for the ecosystem to adopt
the change. A reminder that this field is a list of strings and contributes
heavily to footer bloat. We've seen footers as large as 367 MB in
production, with over 60% of the size coming from this field alone.

On the other hand, the FlatBuffer footer proposal gives us the ability to
not only decode the schema in an efficient way but also remove redundant
fields completely without breaking compatibility with the existing Thrift
footer. It doesn't introduce a breaking change, but also doesn't slow down
our ability to evolve the footer. It buys us time to embrace the change
across the entire ecosystem.

Best, Jiayi

Reply via email to