Hi Jiayi, This is getting a little off topic, but I did a quick test to see how much would be involved in getting some major implementations to support an optional path_in_schema.
In short, parquet-java required changing a single constructor call to using a setter, and replacing two accesses of the path_in_schema with an already available array of paths from the schema metadata. arrow-cpp required no code changes at all beyond regenerating the thrift structures. And as mentioned previously, arrow-rs has never used the field at all. As to performance, removing the field from a 10000 column flat schema saved around 2MB out of 11, so a 17% reduction. Parsing time in arrow-rs improved only about 3% since the field is simply skipped if encountered anyway, so no allocations are saved. I haven't tried benchmarking the other implementations. So I don't think it's going to take years of effort to deprecate that field. Of course, the guidelines for forward-incompatible changes [1] will need to be followed, so it will take some time for the changes to ripple through the ecosystem, but users would have the ability to save a good bit of space by turning the unused field off themselves. If the field is so damaging, I simply don't see why we need to wait any longer to remove it. Just because the v3 proposal exists doesn't mean all work on the current format needs to halt. Will we forestall work on new encodings like ALP until v3 is ready to go? I hope we won't make the perfect the enemy of the good here. Cheers, Ed [1] https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format On 2026/04/07 12:41:35 王嘉仪 wrote: > Hi all, > > We tested whether various Parquet readers depend on path_in_schema (in > ColumnMetaData) to understand the impact of deprecating it. > > We tested parquet-mr, Databricks, DuckDB, ClickHouse, Snowflake, and > Fabric. The result is that parquet-mr and Fabric use path_in_schema as a > hard dependency and cannot read files without it. Databricks supports > reading files without path_in_schema in newer versions, and the other > readers support reading them in their latest versions. > > That said, deprecating the field is hard in the current Thrift-based > Parquet spec and would require years of effort for the ecosystem to adopt > the change. A reminder that this field is a list of strings and contributes > heavily to footer bloat. We've seen footers as large as 367 MB in > production, with over 60% of the size coming from this field alone. > > On the other hand, the FlatBuffer footer proposal gives us the ability to > not only decode the schema in an efficient way but also remove redundant > fields completely without breaking compatibility with the existing Thrift > footer. It doesn't introduce a breaking change, but also doesn't slow down > our ability to evolve the footer. It buys us time to embrace the change > across the entire ecosystem. > > Best, Jiayi >
