Hi folks, While INT96 is now deprecated, it's still the default timestamp type in Spark, resulting in a significant amount of existing data written in this format.
Historically, parquet-mr/java has not emitted or read statistics for INT96. This was likely due to the fact that standard byte comparison on the INT96 representation doesn't align with logical comparisons, potentially leading to incorrect min/max values. This is unfortunate because timestamp filters are extremely common and lack of stats limits optimization opportunities. Since its inception Photon <https://www.databricks.com/product/photon> emitted and utilized INT96 statistics by employing a logical comparator, ensuring their correctness. We have now implemented <https://github.com/apache/parquet-java/pull/3243> the same support within parquet-java. We'd like to get the community's thoughts on this addition. We anticipate that most users may not be directly affected due to the declining use of INT96. However, we are interested in identifying any potential drawbacks or unforeseen issues with this approach. Cheers