Re: [DISCUSS] INT96 stats

Ed Seidl Mon, 08 Jun 2026 11:37:27 -0700

+1 for a new ColumnOrder. It's far preferable to parsing create_by strings. I 
can provide
a Rust PoC once the parquet-format PR is live.


Ed

On 2026/06/08 17:37:50 Divjot Arora via dev wrote:
> Hi folks,
> 
> After more discussion on the right approach to signal validity of
> statistics for int96 columns, we've decided to implement option 3 mentioned
> here <https://lists.apache.org/thread/6t9fr6v602zwt0tw22bqwg81f1ny9ncj>:
> "Formalize ordering as now defined using the timestamp ordering and define
> a new SortOrder required for writers/readers to use stats". This provides
> the strongest guarantee for readers to ensure that the stats are valid,
> which we feel is important given the risk of reading/using incorrect stats.
> Please let me know if anyone has concerns or objections to this approach. I
> will start drafting the parquet-format change and parquet-java
> implementation in parallel.
> 
> -- Divjot
> 
> On Thu, Jun 4, 2026 at 10:14 PM Ryan Blue <[email protected]> wrote:
> 
> > I think that we need to add a sort order so that writers can signal that
> > they produced INT96 stats with timestamp ordering. We recently added a new
> > sort order for float and double to signal basically the same thing and I
> > don't see why we would not do the same thing here.
> >
> > On Wed, Jun 3, 2026 at 5:48 AM Rahul Sharma via dev <
> > [email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > Reviving this thread. I'd like to land Option 1 from Micah's summary
> > (keep
> > > INT96 ordering undefined, allow-list on readers) in parquet-java and I
> > have
> > > an open PR for this: https://github.com/apache/parquet-java/pull/3590.
> > > If there are any objections, let's discuss them in this thread or in the
> > > PR.
> > >
> > > Thanks,
> > > Rahul
> > >
> > >
> > > On Mon, Aug 4, 2025 at 7:03 AM Micah Kornfield <[email protected]>
> > > wrote:
> > >
> > > > Gang Wu via <https://support.google.com/mail/answer/1311182?hl=en>
> > > > parquet.apache.org
> > > > Thu, Jul 24, 1:19 AM (10 days ago)
> > > > to *dev*
> > > >
> > > > > For 1 and 2, do we need to maintain an allow-list for known writer
> > > > > implementations
> > > > > as well as their versions officially? My feeling is no. Perhaps it is
> > > the
> > > > > responsibility
> > > > > of interesting implementations to maintain it internally because many
> > > > > projects may
> > > > > not even care about INT96 stats.
> > > >
> > > >
> > > > I think it would be unofficial as it is not part of the spec.
> > Including
> > > it
> > > > on the compatibility matrix might be helpful.
> > > >
> > > >
> > > > I prefer solutions that don't require an allow list to use INT96
> > stats. I
> > > > > don't agree that we could just let implementations handle the allow
> > > > lists.
> > > > > Whatever Parquet Java implements will be copied by other people and
> > we
> > > > will
> > > > > effectively have an allow list that is not well documented.
> > > >
> > > >
> > > > I think I'm OK with this.  Documenting compatibility can be done via
> > the
> > > > compatibility matrix for those implementations that care about this.
> > > >
> > > >
> > > > >
> > > >
> > > >  For 3, I think it is a bug of implementations who fail on new column
> > > > order.
> > > > > If we want
> > > > > to move forward [1] by adding a new column order for IEEE754 total
> > > order,
> > > > > this bug
> > > > > should be fixed anyway.
> > > >
> > > >
> > > > I agree this would need to be fixed on the rust side for IEEE754, but
> > > that
> > > > is a separate concern.  I personally don't think breaking potential old
> > > > readers for a deprecated type, that will hopefully stop being written
> > to
> > > a
> > > > large extent in ~1 year time, is worth the engineering effort here.
> > > > Especially, if Spark moves away from Int96 as default, there would
> > > probably
> > > > be very few new files written with the sort order.  The real question
> > > then
> > > > becomes whether we want to allow efficient pruning for existing files
> > > that
> > > > are in a known good state.
> > > >
> > > > I'd really rather leave this up to implementation maintainers who are
> > > open
> > > > to accepting PR's to allow listing specific implementations if they
> > feel
> > > it
> > > > is worthwhile.
> > > >
> > > >
> > > >
> > > >
> > > > On Thu, Jul 24, 2025 at 12:04 PM Ryan Blue <[email protected]> wrote:
> > > >
> > > > > I prefer solutions that don't require an allow list to use INT96
> > > stats. I
> > > > > don't agree that we could just let implementations handle the allow
> > > > lists.
> > > > > Whatever Parquet Java implements will be copied by other people and
> > we
> > > > will
> > > > > effectively have an allow list that is not well documented. I think
> > > that
> > > > we
> > > > > need to solve this so the requirements are understood (how to sort
> > > > values)
> > > > > and so that implementations can signal that a file was written with
> > > stats
> > > > > that fit those requirements, without allow lists.
> > > > >
> > > > > On Thu, Jul 24, 2025 at 9:09 AM Alkis Evlogimenos
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > My preference would be 1, 3, 2 in that order. Not super strong
> > > opinion
> > > > > > though, my take is that any of them works for the near term until
> > the
> > > > > type
> > > > > > dies off.
> > > > > >
> > > > > > On Thu, Jul 24, 2025 at 6:46 PM Ed Seidl <[email protected]>
> > wrote:
> > > > > >
> > > > > > > If INT96 is to remain deprecated, I'd prefer 1. If we want a
> > > defined
> > > > > > > ordering for INT96 I'd prefer 3 to maintaining a "known good"
> > list.
> > > > > > >
> > > > > > > As to the forward compatibility issue with rust, that's already
> > an
> > > > > issue
> > > > > > > with logical types (and any other unions in the spec). We're
> > > > currently
> > > > > > > trying to work that [1].
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Ed
> > > > > > >
> > > > > > > [1] https://github.com/apache/arrow-rs/issues/7909
> > > > > > >
> > > > > > > On 2025/07/24 08:19:13 Gang Wu wrote:
> > > > > > > > For 1 and 2, do we need to maintain an allow-list for known
> > > writer
> > > > > > > > implementations
> > > > > > > > as well as their versions officially? My feeling is no. Perhaps
> > > it
> > > > is
> > > > > > the
> > > > > > > > responsibility
> > > > > > > > of interesting implementations to maintain it internally
> > because
> > > > many
> > > > > > > > projects may
> > > > > > > > not even care about INT96 stats.
> > > > > > > >
> > > > > > > > For 3, I think it is a bug of implementations who fail on new
> > > > column
> > > > > > > order.
> > > > > > > > If we want
> > > > > > > > to move forward [1] by adding a new column order for IEEE754
> > > total
> > > > > > order,
> > > > > > > > this bug
> > > > > > > > should be fixed anyway.
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/parquet-format/pull/221
> > > > > > > >
> > > > > > > > On Thu, Jul 24, 2025 at 1:30 AM Micah Kornfield <
> > > > > [email protected]
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Just to follow up on this, I think the last issues remaining
> > > are
> > > > > > > updating
> > > > > > > > > the spec.
> > > > > > > > >
> > > > > > > > > There is already a draft PR (
> > > > > > > > > https://github.com/apache/parquet-format/pull/504) for
> > > updating
> > > > > the
> > > > > > > spec.
> > > > > > > > >
> > > > > > > > > I think there are three main options:
> > > > > > > > > 1.  Keep ordering for int96 undefined with an implementation
> > > note
> > > > > > (the
> > > > > > > > > current PR does this).
> > > > > > > > > 2.  Formalize ordering as now defined using the timestamp
> > > > ordering.
> > > > > > > > > 3.  Formalize ordering as now defined using the timestamp
> > > > ordering
> > > > > > and
> > > > > > > > > define a new SortOrder required for writers/readers to use
> > > stats.
> > > > > > > > >
> > > > > > > > > The main trade-offs are for options 1 and 2, we potentially
> > > need
> > > > to
> > > > > > > allow
> > > > > > > > > list implementations that are known to produce valid stats
> > > (e.g.
> > > > > > older
> > > > > > > > > versions of Rust were writing stats that didn't conform to
> > > > > Timestamp
> > > > > > > > > ordering).
> > > > > > > > >
> > > > > > > > > For item #3, the main issue is that not all readers might be
> > > > > forward
> > > > > > > > > compatible for a new sort order.  In particular Rust readers
> > > > would
> > > > > > > break on
> > > > > > > > > any new files [1].
> > > > > > > > >
> > > > > > > > > Given this I suggest we move forward with the currently
> > opened
> > > PR
> > > > > and
> > > > > > > not
> > > > > > > > > officially formalize this in th spec.  Implementations will
> > > need
> > > > to
> > > > > > > > > allow-list for known good writers.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Micah
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/arrow-rs/issues/7909
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Jun 30, 2025 at 8:55 AM Alkis Evlogimenos
> > > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > I also checked internally with the Spark OSS team and the
> > > plan
> > > > > for
> > > > > > > having
> > > > > > > > > > INT64 timestamps in Spark by default is to make the change
> > > when
> > > > > > > Delta v5
> > > > > > > > > > and Iceberg v4 are proposed. This is expected to happen
> > > around
> > > > > the
> > > > > > > first
> > > > > > > > > > half of 2026.
> > > > > > > > > >
> > > > > > > > > > On Wed, Jun 25, 2025 at 8:41 PM Andrew Lamb <
> > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > We had a good discussion about this at the sync today.
> > > Here
> > > > is
> > > > > > my
> > > > > > > > > > summary
> > > > > > > > > > >
> > > > > > > > > > > * Pedantically, according to the current spec[1] there is
> > > no
> > > > > > > defined
> > > > > > > > > > > ordering for Int96 types and thus arrow-rs can not be
> > > writing
> > > > > > > > > "incorrect"
> > > > > > > > > > > values (as there is no definition of correct)
> > > > > > > > > > > * Practically speaking, arrow-rs is writing something
> > > > different
> > > > > > > than
> > > > > > > > > > Photon
> > > > > > > > > > > (Databricks proprietary spark engine)
> > > > > > > > > > > * What Photon is doing arguably makes more sense (to use
> > > the
> > > > > > > ordering
> > > > > > > > > of
> > > > > > > > > > > the only logical type to use Int96)
> > > > > > > > > > > * GH-7686: [Parquet] Fix int96 min/max stats #7687[2]
> > > brings
> > > > > > > arrow-rs
> > > > > > > > > > into
> > > > > > > > > > > line with Photon which makes sense to me
> > > > > > > > > > >
> > > > > > > > > > > Rahul has also filed a ticket in parquet-format to
> > discuss
> > > > > > > formalizing
> > > > > > > > > > the
> > > > > > > > > > > ordering of Int96 statistics[3]
> > > > > > > > > > >
> > > > > > > > > > > In the interim, I filed a PR[4] in the parquet-format
> > repo
> > > to
> > > > > at
> > > > > > > least
> > > > > > > > > > try
> > > > > > > > > > > and clarify the intent of the changes to arrow-rs and
> > > > > > parquet-java
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Andrew
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [1]:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/parquet-format/blob/cf943c197f4fad826b14ba0c40eb0ffdab585285/src/main/thrift/parquet.thrift#L1079
> > > > > > > > > > > [2]: https://github.com/apache/arrow-rs/pull/7687
> > > > > > > > > > > [3]: https://github.com/apache/parquet-format/issues/502
> > > > > > > > > > > [4]: https://github.com/apache/parquet-format/pull/504
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jun 25, 2025 at 10:52 AM Rahul Sharma
> > > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I have prepared a doc
> > > > > > > > > > > > <
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1Ox0qHYBgs_3-pNqn9V8zVQm_W6qP0lsbd2XwQnQVz1Y/edit?tab=t.0
> > > > > > > > > > > > >
> > > > > > > > > > > > to summarize and have all the relevant links in one
> > > place.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jun 25, 2025 at 1:32 PM Alkis Evlogimenos
> > > > > > > > > > > > <[email protected]> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Spark needs to start writing INT64 nanos first to be
> > > able
> > > > > to
> > > > > > > > > replace
> > > > > > > > > > > > INT96
> > > > > > > > > > > > > which is in nanos if data is at nano granularity.
> > This
> > > is
> > > > > > why I
> > > > > > > > > > linked
> > > > > > > > > > > > that
> > > > > > > > > > > > > ticket which is a prerequisite to switching to INT64
> > in
> > > > > many
> > > > > > > cases.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I understand the concerns around changing a
> > deprecated
> > > > > aspect
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > > parquet spec. The reason we decided to bring this
> > > forward
> > > > > is
> > > > > > > > > because:
> > > > > > > > > > > > > 1. there are a lot of parquet files with the right
> > > INT96
> > > > > > stats
> > > > > > > > > > outthere
> > > > > > > > > > > > > (Photon has been writing them for years)
> > > > > > > > > > > > > 2. all engines ignore the INT96 stats so Photon
> > writing
> > > > > them
> > > > > > > didn't
> > > > > > > > > > > break
> > > > > > > > > > > > > anyone
> > > > > > > > > > > > > 3. Spark is (slowly) moving away from INT96
> > > > > > > > > > > > > 4. our change is very narrow, backwards compatible
> > and
> > > > can
> > > > > > > improve
> > > > > > > > > > > > current
> > > > > > > > > > > > > workloads while (3) is ongoing
> > > > > > > > > > > > >
> > > > > > > > > > > > > Let's discuss more at the sync tonight.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > If we are going to standardize an ordering for
> > INT96,
> > > > > > rather
> > > > > > > than
> > > > > > > > > > > > parsing
> > > > > > > > > > > > > "created_by" fields, wouldn't it make more sense to
> > > add a
> > > > > new
> > > > > > > > > > > ColumnOrder
> > > > > > > > > > > > > value (like what's proposed for PARQUET-2249 [1])?
> > Then
> > > > we
> > > > > > > don't
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > > > maintain a list of known good writers.
> > > > > > > > > > > > >
> > > > > > > > > > > > > We do not have to add another ColumnOrder value since
> > > > INT96
> > > > > > is
> > > > > > > a
> > > > > > > > > > > > *physical*
> > > > > > > > > > > > > type and can only take timestamps in the specified
> > > > format.
> > > > > > > This was
> > > > > > > > > > > > > arguably a design wart as it should have been a
> > > > > > > > > > > FIXED_LEN_BYTE_ARRAY(12)
> > > > > > > > > > > > > with logical type INT96_TIMESTAMP, for which a
> > > different
> > > > > > > > > ColumnOrder
> > > > > > > > > > > > would
> > > > > > > > > > > > > make sense. In this case we are lucky this is a
> > > physical
> > > > > type
> > > > > > > > > without
> > > > > > > > > > > > > logical type attached because otherwise, we couldn't
> > > have
> > > > > > made
> > > > > > > this
> > > > > > > > > > > > change
> > > > > > > > > > > > > in a backwards compatible way as easily.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sat, Jun 21, 2025 at 12:57 AM Ed Seidl <
> > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > If we are going to standardize an ordering for
> > INT96,
> > > > > > rather
> > > > > > > than
> > > > > > > > > > > > parsing
> > > > > > > > > > > > > > "created_by" fields, wouldn't it make more sense to
> > > > add a
> > > > > > new
> > > > > > > > > > > > ColumnOrder
> > > > > > > > > > > > > > value (like what's proposed for PARQUET-2249 [1])?
> > > Then
> > > > > we
> > > > > > > don't
> > > > > > > > > > need
> > > > > > > > > > > > to
> > > > > > > > > > > > > > maintain a list of known good writers.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ed
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > [1]
> > > https://github.com/apache/parquet-format/pull/221
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 2025/06/19 10:15:13 Andrew Lamb wrote:
> > > > > > > > > > > > > > > > While INT96 is now deprecated, it's still the
> > > > default
> > > > > > > > > timestamp
> > > > > > > > > > > > type
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > Spark, resulting in a significant amount of
> > > > existing
> > > > > > data
> > > > > > > > > > written
> > > > > > > > > > > > in
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > format.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I agree with Gang and Antoine that the better
> > > > solution
> > > > > is
> > > > > > > to
> > > > > > > > > > change
> > > > > > > > > > > > > Spark
> > > > > > > > > > > > > > > to write non deprecated parquet data types.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It seems there is an issue in the Spark JIRA to
> > do
> > > > > > this[1]
> > > > > > > but
> > > > > > > > > > the
> > > > > > > > > > > > only
> > > > > > > > > > > > > > > feedback on the associated PR [2] is that it is a
> > > > > > breaking
> > > > > > > > > > change.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If Spark is going to keep writing INT96
> > timestamps
> > > > > > > > > indefinitely,
> > > > > > > > > > I
> > > > > > > > > > > > > > suggest
> > > > > > > > > > > > > > > we un-deprecate the INT96 timestamps to reflect
> > the
> > > > > > > ecosystem
> > > > > > > > > > > reality
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > they will be here for a while rather than
> > > pretending
> > > > > they
> > > > > > > are
> > > > > > > > > > > really
> > > > > > > > > > > > > > > deprecated.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Andrew
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [1]:
> > > > https://issues.apache.org/jira/browse/SPARK-51359
> > > > > > > > > > > > > > > [2]:
> > > > > > > > > > > > >
> > > > > > >
> > https://github.com/apache/spark/pull/50215#issuecomment-2715147840
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > p.s. as an aside, is anyone from DataBricks
> > pushing
> > > > > spark
> > > > > > > to
> > > > > > > > > > change
> > > > > > > > > > > > > > > timestamp type? Or will the focus be to  improve
> > > > INT96
> > > > > > > > > timestamps
> > > > > > > > > > > > > > instead?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 10:50 PM Gang Wu <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It seems not adding too much value to improve a
> > > > > > > deprecated
> > > > > > > > > > > feature
> > > > > > > > > > > > > > > > especially
> > > > > > > > > > > > > > > > when there are abundant Parquet implementations
> > > in
> > > > > the
> > > > > > > wild.
> > > > > > > > > > > IIRC,
> > > > > > > > > > > > > > > > parquet-java
> > > > > > > > > > > > > > > > is planning to release 1.16.0 for new data
> > types
> > > > like
> > > > > > > variant
> > > > > > > > > > and
> > > > > > > > > > > > > > geometry.
> > > > > > > > > > > > > > > > It is
> > > > > > > > > > > > > > > > also the last version to support Java 8. All
> > > > > deprecated
> > > > > > > APIs
> > > > > > > > > > > might
> > > > > > > > > > > > > get
> > > > > > > > > > > > > > > > removed
> > > > > > > > > > > > > > > > from 2.0.0 so I'm not sure if older Spark
> > > versions
> > > > > are
> > > > > > > able
> > > > > > > > > to
> > > > > > > > > > > > > > leverage the
> > > > > > > > > > > > > > > > int96
> > > > > > > > > > > > > > > > stats. The right way to go is to push forward
> > the
> > > > > > > adoption of
> > > > > > > > > > > > > timestamp
> > > > > > > > > > > > > > > > logical
> > > > > > > > > > > > > > > > types.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > Gang
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Thu, Jun 19, 2025 at 12:31 AM Micah
> > Kornfield
> > > <
> > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi Alkis,
> > > > > > > > > > > > > > > > > Is this the right thread link?  It seems to
> > be
> > > a
> > > > > > > discussion
> > > > > > > > > > on
> > > > > > > > > > > > > > Timestamp
> > > > > > > > > > > > > > > > > Nano support (which IIUC won't use int96, but
> > > I'm
> > > > > not
> > > > > > > sure
> > > > > > > > > > this
> > > > > > > > > > > > > > covers
> > > > > > > > > > > > > > > > > changing the behavior for existing
> > timestamps,
> > > > > which
> > > > > > I
> > > > > > > > > think
> > > > > > > > > > > are
> > > > > > > > > > > > at
> > > > > > > > > > > > > > > > either
> > > > > > > > > > > > > > > > > millisecond or microsecond granularity)?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > there will be customers that want to
> > interface
> > > > with
> > > > > > > legacy
> > > > > > > > > > > > systems
> > > > > > > > > > > > > > > > > > with INT96. This is why we decided in doing
> > > > both.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > It might help to elaborate on the time-frame
> > > > here.
> > > > > > > Since
> > > > > > > > > it
> > > > > > > > > > > > > appears
> > > > > > > > > > > > > > > > > reference implementations of parquet are not
> > > > > > currently
> > > > > > > > > > writing
> > > > > > > > > > > > > > > > statistics,
> > > > > > > > > > > > > > > > > if we merge these changes when they will be
> > > > picked
> > > > > up
> > > > > > > in
> > > > > > > > > > Spark?
> > > > > > > > > > > > > > Would the
> > > > > > > > > > > > > > > > > plan be to backport the parquet-java to older
> > > > > version
> > > > > > > of
> > > > > > > > > > Spark
> > > > > > > > > > > > > > (otherwise
> > > > > > > > > > > > > > > > > the legacy systems wouldn't really make use
> > or
> > > > emit
> > > > > > > stats
> > > > > > > > > > > > anyways)?
> > > > > > > > > > > > > > What
> > > > > > > > > > > > > > > > > is the delta between Spark picking up these
> > > > changes
> > > > > > and
> > > > > > > > > > > > > > transitioning off
> > > > > > > > > > > > > > > > > of Int96 by default?   Is the expectation
> > that
> > > > even
> > > > > > > once
> > > > > > > > > the
> > > > > > > > > > > > > default
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > changed in spark to not use int96, there will
> > > be
> > > > a
> > > > > > > large
> > > > > > > > > > number
> > > > > > > > > > > > of
> > > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > > that will override the default to write
> > int96?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Micah
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 1:35 AM Alkis
> > > Evlogimenos
> > > > > > > > > > > > > > > > > <[email protected]>
> > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > We are also driving that in parallel:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of
> > > > > > > > > > > > > .
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Even when Spark defaults to INT64 there
> > will
> > > be
> > > > > old
> > > > > > > > > > versions
> > > > > > > > > > > of
> > > > > > > > > > > > > > Spark
> > > > > > > > > > > > > > > > > > running, there will be customers that want
> > to
> > > > > > > interface
> > > > > > > > > > with
> > > > > > > > > > > > > legacy
> > > > > > > > > > > > > > > > > systems
> > > > > > > > > > > > > > > > > > with INT96. This is why we decided in doing
> > > > both.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 9:53 AM Antoine
> > > Pitrou
> > > > <
> > > > > > > > > > > > > [email protected]
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Can we get Spark to stop emitting INT96?
> > > They
> > > > > are
> > > > > > > not
> > > > > > > > > > being
> > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > > extremely good community player here.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Regards
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Antoine.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Fri, 13 Jun 2025 15:17:51 +0200
> > > > > > > > > > > > > > > > > > > Alkis Evlogimenos
> > > > > > > > > > > > > > > > > > > <[email protected]
> > >
> > > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > Hi folks,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > While INT96 is now deprecated, it's
> > still
> > > > the
> > > > > > > default
> > > > > > > > > > > > > timestamp
> > > > > > > > > > > > > > > > type
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > > > Spark, resulting in a significant
> > amount
> > > of
> > > > > > > existing
> > > > > > > > > > data
> > > > > > > > > > > > > > written
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > > > format.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Historically, parquet-mr/java has not
> > > > emitted
> > > > > > or
> > > > > > > read
> > > > > > > > > > > > > > statistics
> > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > > INT96.
> > > > > > > > > > > > > > > > > > > > This was likely due to the fact that
> > > > standard
> > > > > > > byte
> > > > > > > > > > > > comparison
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > INT96
> > > > > > > > > > > > > > > > > > > > representation doesn't align with
> > logical
> > > > > > > > > comparisons,
> > > > > > > > > > > > > > potentially
> > > > > > > > > > > > > > > > > > > leading
> > > > > > > > > > > > > > > > > > > > to incorrect min/max values. This is
> > > > > > unfortunate
> > > > > > > > > > because
> > > > > > > > > > > > > > timestamp
> > > > > > > > > > > > > > > > > > > filters
> > > > > > > > > > > > > > > > > > > > are extremely common and lack of stats
> > > > limits
> > > > > > > > > > > optimization
> > > > > > > > > > > > > > > > > > opportunities.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Since its inception Photon <
> > > > > > > > > > > > > > > > > https://www.databricks.com/product/photon>
> > > > > > > > > > > > > > > > > > > emitted
> > > > > > > > > > > > > > > > > > > > and utilized INT96 statistics by
> > > employing
> > > > a
> > > > > > > logical
> > > > > > > > > > > > > > comparator,
> > > > > > > > > > > > > > > > > > ensuring
> > > > > > > > > > > > > > > > > > > > their correctness. We have now
> > > implemented
> > > > > > > > > > > > > > > > > > > > <
> > > > > > > https://github.com/apache/parquet-java/pull/3243>
> > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > > > > > support
> > > > > > > > > > > > > > > > > > > within
> > > > > > > > > > > > > > > > > > > > parquet-java.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > We'd like to get the community's
> > thoughts
> > > > on
> > > > > > this
> > > > > > > > > > > addition.
> > > > > > > > > > > > > We
> > > > > > > > > > > > > > > > > > anticipate
> > > > > > > > > > > > > > > > > > > > that most users may not be directly
> > > > affected
> > > > > > due
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > > > > declining
> > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > > > INT96. However, we are interested in
> > > > > > identifying
> > > > > > > any
> > > > > > > > > > > > > potential
> > > > > > > > > > > > > > > > > > drawbacks
> > > > > > > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > > > unforeseen issues with this approach.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Cheers
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] INT96 stats

Reply via email to