Re: [DISCUSS] INT96 stats

Gang Wu Wed, 18 Jun 2025 19:50:32 -0700

It seems not adding too much value to improve a deprecated feature
especially
when there are abundant Parquet implementations in the wild. IIRC,
parquet-java
is planning to release 1.16.0 for new data types like variant and geometry.
It is
also the last version to support Java 8. All deprecated APIs might get
removed
from 2.0.0 so I'm not sure if older Spark versions are able to leverage the
int96
stats. The right way to go is to push forward the adoption of timestamp
logical
types.


Best,
Gang

On Thu, Jun 19, 2025 at 12:31 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi Alkis,
> Is this the right thread link?  It seems to be a discussion on Timestamp
> Nano support (which IIUC won't use int96, but I'm not sure this covers
> changing the behavior for existing timestamps, which I think are at either
> millisecond or microsecond granularity)?
>
> there will be customers that want to interface with legacy systems
> > with INT96. This is why we decided in doing both.
>
>
> It might help to elaborate on the time-frame here.  Since it appears
> reference implementations of parquet are not currently writing statistics,
> if we merge these changes when they will be picked up in Spark? Would the
> plan be to backport the parquet-java to older version of Spark (otherwise
> the legacy systems wouldn't really make use or emit stats anyways)?  What
> is the delta between Spark picking up these changes and transitioning off
> of Int96 by default?   Is the expectation that even once the default is
> changed in spark to not use int96, there will be a large number of users
> that will override the default to write int96?
>
> Thanks,
> Micah
>
> On Wed, Jun 18, 2025 at 1:35 AM Alkis Evlogimenos
> <alkis.evlogime...@databricks.com.invalid> wrote:
>
> > We are also driving that in parallel:
> > https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of.
> >
> > Even when Spark defaults to INT64 there will be old versions of Spark
> > running, there will be customers that want to interface with legacy
> systems
> > with INT96. This is why we decided in doing both.
> >
> > On Wed, Jun 18, 2025 at 9:53 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Can we get Spark to stop emitting INT96? They are not being an
> > > extremely good community player here.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Fri, 13 Jun 2025 15:17:51 +0200
> > > Alkis Evlogimenos
> > > <alkis.evlogime...@databricks.com.INVALID>
> > > wrote:
> > > > Hi folks,
> > > >
> > > > While INT96 is now deprecated, it's still the default timestamp type
> in
> > > > Spark, resulting in a significant amount of existing data written in
> > this
> > > > format.
> > > >
> > > > Historically, parquet-mr/java has not emitted or read statistics for
> > > INT96.
> > > > This was likely due to the fact that standard byte comparison on the
> > > INT96
> > > > representation doesn't align with logical comparisons, potentially
> > > leading
> > > > to incorrect min/max values. This is unfortunate because timestamp
> > > filters
> > > > are extremely common and lack of stats limits optimization
> > opportunities.
> > > >
> > > > Since its inception Photon <
> https://www.databricks.com/product/photon>
> > > emitted
> > > > and utilized INT96 statistics by employing a logical comparator,
> > ensuring
> > > > their correctness. We have now implemented
> > > > <https://github.com/apache/parquet-java/pull/3243> the same support
> > > within
> > > > parquet-java.
> > > >
> > > > We'd like to get the community's thoughts on this addition. We
> > anticipate
> > > > that most users may not be directly affected due to the declining use
> > of
> > > > INT96. However, we are interested in identifying any potential
> > drawbacks
> > > or
> > > > unforeseen issues with this approach.
> > > >
> > > > Cheers
> > > >
> > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] INT96 stats

Reply via email to