Re: [DISCUSS] INT96 stats

Ed Seidl Fri, 20 Jun 2025 15:57:52 -0700

If we are going to standardize an ordering for INT96, rather than parsing 
"created_by" fields, wouldn't it make more sense to add a new ColumnOrder value 
(like what's proposed for PARQUET-2249 [1])? Then we don't need to maintain a 
list of known good writers.


Ed

[1] https://github.com/apache/parquet-format/pull/221

On 2025/06/19 10:15:13 Andrew Lamb wrote:
> > While INT96 is now deprecated, it's still the default timestamp type in
> > Spark, resulting in a significant amount of existing data written in this
> > format.
> 
> I agree with Gang and Antoine that the better solution is to change Spark
> to write non deprecated parquet data types.
> 
> It seems there is an issue in the Spark JIRA to do this[1] but the only
> feedback on the associated PR [2] is that it is a breaking change.
> 
> If Spark is going to keep writing INT96 timestamps indefinitely, I suggest
> we un-deprecate the INT96 timestamps to reflect the ecosystem reality that
> they will be here for a while rather than pretending they are really
> deprecated.
> 
> Andrew
> 
> [1]: https://issues.apache.org/jira/browse/SPARK-51359
> [2]: https://github.com/apache/spark/pull/50215#issuecomment-2715147840
> 
> p.s. as an aside, is anyone from DataBricks pushing spark to change
> timestamp type? Or will the focus be to  improve INT96 timestamps instead?
> 
> 
> On Wed, Jun 18, 2025 at 10:50 PM Gang Wu <ust...@gmail.com> wrote:
> 
> > It seems not adding too much value to improve a deprecated feature
> > especially
> > when there are abundant Parquet implementations in the wild. IIRC,
> > parquet-java
> > is planning to release 1.16.0 for new data types like variant and geometry.
> > It is
> > also the last version to support Java 8. All deprecated APIs might get
> > removed
> > from 2.0.0 so I'm not sure if older Spark versions are able to leverage the
> > int96
> > stats. The right way to go is to push forward the adoption of timestamp
> > logical
> > types.
> >
> > Best,
> > Gang
> >
> > On Thu, Jun 19, 2025 at 12:31 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > Hi Alkis,
> > > Is this the right thread link?  It seems to be a discussion on Timestamp
> > > Nano support (which IIUC won't use int96, but I'm not sure this covers
> > > changing the behavior for existing timestamps, which I think are at
> > either
> > > millisecond or microsecond granularity)?
> > >
> > > there will be customers that want to interface with legacy systems
> > > > with INT96. This is why we decided in doing both.
> > >
> > >
> > > It might help to elaborate on the time-frame here.  Since it appears
> > > reference implementations of parquet are not currently writing
> > statistics,
> > > if we merge these changes when they will be picked up in Spark? Would the
> > > plan be to backport the parquet-java to older version of Spark (otherwise
> > > the legacy systems wouldn't really make use or emit stats anyways)?  What
> > > is the delta between Spark picking up these changes and transitioning off
> > > of Int96 by default?   Is the expectation that even once the default is
> > > changed in spark to not use int96, there will be a large number of users
> > > that will override the default to write int96?
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Wed, Jun 18, 2025 at 1:35 AM Alkis Evlogimenos
> > > <alkis.evlogime...@databricks.com.invalid> wrote:
> > >
> > > > We are also driving that in parallel:
> > > > https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of.
> > > >
> > > > Even when Spark defaults to INT64 there will be old versions of Spark
> > > > running, there will be customers that want to interface with legacy
> > > systems
> > > > with INT96. This is why we decided in doing both.
> > > >
> > > > On Wed, Jun 18, 2025 at 9:53 AM Antoine Pitrou <anto...@python.org>
> > > wrote:
> > > >
> > > > >
> > > > > Can we get Spark to stop emitting INT96? They are not being an
> > > > > extremely good community player here.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > > On Fri, 13 Jun 2025 15:17:51 +0200
> > > > > Alkis Evlogimenos
> > > > > <alkis.evlogime...@databricks.com.INVALID>
> > > > > wrote:
> > > > > > Hi folks,
> > > > > >
> > > > > > While INT96 is now deprecated, it's still the default timestamp
> > type
> > > in
> > > > > > Spark, resulting in a significant amount of existing data written
> > in
> > > > this
> > > > > > format.
> > > > > >
> > > > > > Historically, parquet-mr/java has not emitted or read statistics
> > for
> > > > > INT96.
> > > > > > This was likely due to the fact that standard byte comparison on
> > the
> > > > > INT96
> > > > > > representation doesn't align with logical comparisons, potentially
> > > > > leading
> > > > > > to incorrect min/max values. This is unfortunate because timestamp
> > > > > filters
> > > > > > are extremely common and lack of stats limits optimization
> > > > opportunities.
> > > > > >
> > > > > > Since its inception Photon <
> > > https://www.databricks.com/product/photon>
> > > > > emitted
> > > > > > and utilized INT96 statistics by employing a logical comparator,
> > > > ensuring
> > > > > > their correctness. We have now implemented
> > > > > > <https://github.com/apache/parquet-java/pull/3243> the same
> > support
> > > > > within
> > > > > > parquet-java.
> > > > > >
> > > > > > We'd like to get the community's thoughts on this addition. We
> > > > anticipate
> > > > > > that most users may not be directly affected due to the declining
> > use
> > > > of
> > > > > > INT96. However, we are interested in identifying any potential
> > > > drawbacks
> > > > > or
> > > > > > unforeseen issues with this approach.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] INT96 stats

Reply via email to