[DISCUSS] INT96 stats

Alkis Evlogimenos Fri, 13 Jun 2025 06:19:33 -0700

Hi folks,

While INT96 is now deprecated, it's still the default timestamp type in
Spark, resulting in a significant amount of existing data written in this
format.


Historically, parquet-mr/java has not emitted or read statistics for INT96.
This was likely due to the fact that standard byte comparison on the INT96
representation doesn't align with logical comparisons, potentially leading
to incorrect min/max values. This is unfortunate because timestamp filters
are extremely common and lack of stats limits optimization opportunities.

Since its inception Photon <https://www.databricks.com/product/photon> emitted
and utilized INT96 statistics by employing a logical comparator, ensuring
their correctness. We have now implemented
<https://github.com/apache/parquet-java/pull/3243> the same support within
parquet-java.

We'd like to get the community's thoughts on this addition. We anticipate
that most users may not be directly affected due to the declining use of
INT96. However, we are interested in identifying any potential drawbacks or
unforeseen issues with this approach.

Cheers

[DISCUSS] INT96 stats

Reply via email to