Hi all, I would like to follow up on parquet-java issue #3215 and the associated PR.
Issue link: https://github.com/apache/parquet-java/issues/3215 PR link: https://github.com/apache/parquet-java/pull/3216 Currently, the list of column encodings written to the file footer is built from a HashSet. Since its iteration order is not deterministic across JVM processes, identical data written with the same parquet-java version and configuration can still produce binary-different files. Our goal is to enable binary-identical Parquet files when input data and configuration are the same. To achieve this, the encoding list should be sorted before being written (e.g., by enum ordinal) to ensure deterministic ordering. The PR implements this in ParquetMetadataConverter::toFormatEncodings. We have been running this change in production for over a year without issues, as deterministic binary output is important for our use case. We would prefer to avoid maintaining a downstream patch and instead contribute this upstream if acceptable. Please let me know if there are concerns with this approach or if anything is missing to move the PR forward. Best regards, Mikael Brännström
