Hi all,

I would like to follow up on parquet-java issue #3215 and the associated PR.

Issue link: https://github.com/apache/parquet-java/issues/3215
PR link: https://github.com/apache/parquet-java/pull/3216

Currently, the list of column encodings written to the file footer is built 
from a HashSet. Since its iteration order is not deterministic across JVM 
processes, identical data written with the same parquet-java version and 
configuration can still produce binary-different files.

Our goal is to enable binary-identical Parquet files when input data and 
configuration are the same. To achieve this, the encoding list should be sorted 
before being written (e.g., by enum ordinal) to ensure deterministic ordering. 
The PR implements this in ParquetMetadataConverter::toFormatEncodings.

We have been running this change in production for over a year without issues, 
as deterministic binary output is important for our use case. We would prefer 
to avoid maintaining a downstream patch and instead contribute this upstream if 
acceptable.

Please let me know if there are concerns with this approach or if anything is 
missing to move the PR forward.

Best regards,
Mikael Brännström



Reply via email to