Re: [DISCUSS] INT96 stats

Andrew Lamb Thu, 19 Jun 2025 03:17:49 -0700

> While INT96 is now deprecated, it's still the default timestamp type in
> Spark, resulting in a significant amount of existing data written in this
> format.


I agree with Gang and Antoine that the better solution is to change Spark
to write non deprecated parquet data types.

It seems there is an issue in the Spark JIRA to do this[1] but the only
feedback on the associated PR [2] is that it is a breaking change.

If Spark is going to keep writing INT96 timestamps indefinitely, I suggest
we un-deprecate the INT96 timestamps to reflect the ecosystem reality that
they will be here for a while rather than pretending they are really
deprecated.

Andrew

[1]: https://issues.apache.org/jira/browse/SPARK-51359
[2]: https://github.com/apache/spark/pull/50215#issuecomment-2715147840

p.s. as an aside, is anyone from DataBricks pushing spark to change
timestamp type? Or will the focus be to  improve INT96 timestamps instead?


On Wed, Jun 18, 2025 at 10:50 PM Gang Wu <ust...@gmail.com> wrote:

> It seems not adding too much value to improve a deprecated feature
> especially
> when there are abundant Parquet implementations in the wild. IIRC,
> parquet-java
> is planning to release 1.16.0 for new data types like variant and geometry.
> It is
> also the last version to support Java 8. All deprecated APIs might get
> removed
> from 2.0.0 so I'm not sure if older Spark versions are able to leverage the
> int96
> stats. The right way to go is to push forward the adoption of timestamp
> logical
> types.
>
> Best,
> Gang
>
> On Thu, Jun 19, 2025 at 12:31 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Alkis,
> > Is this the right thread link?  It seems to be a discussion on Timestamp
> > Nano support (which IIUC won't use int96, but I'm not sure this covers
> > changing the behavior for existing timestamps, which I think are at
> either
> > millisecond or microsecond granularity)?
> >
> > there will be customers that want to interface with legacy systems
> > > with INT96. This is why we decided in doing both.
> >
> >
> > It might help to elaborate on the time-frame here.  Since it appears
> > reference implementations of parquet are not currently writing
> statistics,
> > if we merge these changes when they will be picked up in Spark? Would the
> > plan be to backport the parquet-java to older version of Spark (otherwise
> > the legacy systems wouldn't really make use or emit stats anyways)?  What
> > is the delta between Spark picking up these changes and transitioning off
> > of Int96 by default?   Is the expectation that even once the default is
> > changed in spark to not use int96, there will be a large number of users
> > that will override the default to write int96?
> >
> > Thanks,
> > Micah
> >
> > On Wed, Jun 18, 2025 at 1:35 AM Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.invalid> wrote:
> >
> > > We are also driving that in parallel:
> > > https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of.
> > >
> > > Even when Spark defaults to INT64 there will be old versions of Spark
> > > running, there will be customers that want to interface with legacy
> > systems
> > > with INT96. This is why we decided in doing both.
> > >
> > > On Wed, Jun 18, 2025 at 9:53 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >
> > > >
> > > > Can we get Spark to stop emitting INT96? They are not being an
> > > > extremely good community player here.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > On Fri, 13 Jun 2025 15:17:51 +0200
> > > > Alkis Evlogimenos
> > > > <alkis.evlogime...@databricks.com.INVALID>
> > > > wrote:
> > > > > Hi folks,
> > > > >
> > > > > While INT96 is now deprecated, it's still the default timestamp
> type
> > in
> > > > > Spark, resulting in a significant amount of existing data written
> in
> > > this
> > > > > format.
> > > > >
> > > > > Historically, parquet-mr/java has not emitted or read statistics
> for
> > > > INT96.
> > > > > This was likely due to the fact that standard byte comparison on
> the
> > > > INT96
> > > > > representation doesn't align with logical comparisons, potentially
> > > > leading
> > > > > to incorrect min/max values. This is unfortunate because timestamp
> > > > filters
> > > > > are extremely common and lack of stats limits optimization
> > > opportunities.
> > > > >
> > > > > Since its inception Photon <
> > https://www.databricks.com/product/photon>
> > > > emitted
> > > > > and utilized INT96 statistics by employing a logical comparator,
> > > ensuring
> > > > > their correctness. We have now implemented
> > > > > <https://github.com/apache/parquet-java/pull/3243> the same
> support
> > > > within
> > > > > parquet-java.
> > > > >
> > > > > We'd like to get the community's thoughts on this addition. We
> > > anticipate
> > > > > that most users may not be directly affected due to the declining
> use
> > > of
> > > > > INT96. However, we are interested in identifying any potential
> > > drawbacks
> > > > or
> > > > > unforeseen issues with this approach.
> > > > >
> > > > > Cheers
> > > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] INT96 stats

Reply via email to