Re: [DISCUSS] INT96 stats

Alkis Evlogimenos Wed, 18 Jun 2025 01:36:29 -0700

We are also driving that in parallel:
https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of.


Even when Spark defaults to INT64 there will be old versions of Spark
running, there will be customers that want to interface with legacy systems
with INT96. This is why we decided in doing both.

On Wed, Jun 18, 2025 at 9:53 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Can we get Spark to stop emitting INT96? They are not being an
> extremely good community player here.
>
> Regards
>
> Antoine.
>
>
> On Fri, 13 Jun 2025 15:17:51 +0200
> Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.INVALID>
> wrote:
> > Hi folks,
> >
> > While INT96 is now deprecated, it's still the default timestamp type in
> > Spark, resulting in a significant amount of existing data written in this
> > format.
> >
> > Historically, parquet-mr/java has not emitted or read statistics for
> INT96.
> > This was likely due to the fact that standard byte comparison on the
> INT96
> > representation doesn't align with logical comparisons, potentially
> leading
> > to incorrect min/max values. This is unfortunate because timestamp
> filters
> > are extremely common and lack of stats limits optimization opportunities.
> >
> > Since its inception Photon <https://www.databricks.com/product/photon>
> emitted
> > and utilized INT96 statistics by employing a logical comparator, ensuring
> > their correctness. We have now implemented
> > <https://github.com/apache/parquet-java/pull/3243> the same support
> within
> > parquet-java.
> >
> > We'd like to get the community's thoughts on this addition. We anticipate
> > that most users may not be directly affected due to the declining use of
> > INT96. However, we are interested in identifying any potential drawbacks
> or
> > unforeseen issues with this approach.
> >
> > Cheers
> >
>
>
>
>

Re: [DISCUSS] INT96 stats

Reply via email to