Hi all, We tested whether various Parquet readers depend on path_in_schema (in ColumnMetaData) to understand the impact of deprecating it.
We tested parquet-mr, Databricks, DuckDB, ClickHouse, Snowflake, and Fabric. The result is that parquet-mr and Fabric use path_in_schema as a hard dependency and cannot read files without it. Databricks supports reading files without path_in_schema in newer versions, and the other readers support reading them in their latest versions. That said, deprecating the field is hard in the current Thrift-based Parquet spec and would require years of effort for the ecosystem to adopt the change. A reminder that this field is a list of strings and contributes heavily to footer bloat. We've seen footers as large as 367 MB in production, with over 60% of the size coming from this field alone. On the other hand, the FlatBuffer footer proposal gives us the ability to not only decode the schema in an efficient way but also remove redundant fields completely without breaking compatibility with the existing Thrift footer. It doesn't introduce a breaking change, but also doesn't slow down our ability to evolve the footer. It buys us time to embrace the change across the entire ecosystem. Best, Jiayi
