Hi Alkis,
Is this the right thread link?  It seems to be a discussion on Timestamp
Nano support (which IIUC won't use int96, but I'm not sure this covers
changing the behavior for existing timestamps, which I think are at either
millisecond or microsecond granularity)?

there will be customers that want to interface with legacy systems
> with INT96. This is why we decided in doing both.


It might help to elaborate on the time-frame here.  Since it appears
reference implementations of parquet are not currently writing statistics,
if we merge these changes when they will be picked up in Spark? Would the
plan be to backport the parquet-java to older version of Spark (otherwise
the legacy systems wouldn't really make use or emit stats anyways)?  What
is the delta between Spark picking up these changes and transitioning off
of Int96 by default?   Is the expectation that even once the default is
changed in spark to not use int96, there will be a large number of users
that will override the default to write int96?

Thanks,
Micah

On Wed, Jun 18, 2025 at 1:35 AM Alkis Evlogimenos
<alkis.evlogime...@databricks.com.invalid> wrote:

> We are also driving that in parallel:
> https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of.
>
> Even when Spark defaults to INT64 there will be old versions of Spark
> running, there will be customers that want to interface with legacy systems
> with INT96. This is why we decided in doing both.
>
> On Wed, Jun 18, 2025 at 9:53 AM Antoine Pitrou <anto...@python.org> wrote:
>
> >
> > Can we get Spark to stop emitting INT96? They are not being an
> > extremely good community player here.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > On Fri, 13 Jun 2025 15:17:51 +0200
> > Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.INVALID>
> > wrote:
> > > Hi folks,
> > >
> > > While INT96 is now deprecated, it's still the default timestamp type in
> > > Spark, resulting in a significant amount of existing data written in
> this
> > > format.
> > >
> > > Historically, parquet-mr/java has not emitted or read statistics for
> > INT96.
> > > This was likely due to the fact that standard byte comparison on the
> > INT96
> > > representation doesn't align with logical comparisons, potentially
> > leading
> > > to incorrect min/max values. This is unfortunate because timestamp
> > filters
> > > are extremely common and lack of stats limits optimization
> opportunities.
> > >
> > > Since its inception Photon <https://www.databricks.com/product/photon>
> > emitted
> > > and utilized INT96 statistics by employing a logical comparator,
> ensuring
> > > their correctness. We have now implemented
> > > <https://github.com/apache/parquet-java/pull/3243> the same support
> > within
> > > parquet-java.
> > >
> > > We'd like to get the community's thoughts on this addition. We
> anticipate
> > > that most users may not be directly affected due to the declining use
> of
> > > INT96. However, we are interested in identifying any potential
> drawbacks
> > or
> > > unforeseen issues with this approach.
> > >
> > > Cheers
> > >
> >
> >
> >
> >
>

Reply via email to