Re: [DISCUSS] INT96 stats

Divjot Arora via dev Fri, 19 Jun 2026 05:28:10 -0700

Hi everyone,

Thank you for the input so far on the PRs. Given that the parquet-format PR
[1] has two approvals, I plan to give it another week before starting an
official vote. Please take a look at the spec and implementation PRs if
interested, all feedback is welcome.


-- Divjot Arora

[1] https://github.com/apache/parquet-format/pull/584

On Mon, Jun 15, 2026 at 9:20 PM Divjot Arora <[email protected]>
wrote:

> Thanks for the input folks. I've gone ahead and opened up a parquet-format
> PR to add this new sort order [1] as well as a parquet-java reference
> implementation PR [2]. Ed has also graciously opened an arrow-rs PR with a
> reference implementation [3]. Thank you everyone who's already reviewed
> these PRs, the feedback has been super helpful. Please take a look and
> leave comments if you're interested.
>
> -- Divjot Arora
>
> [1] https://github.com/apache/parquet-format/pull/584
> [2] https://github.com/apache/parquet-java/pull/3610
> [3] https://github.com/apache/arrow-rs/pull/10106
>
> On Mon, Jun 8, 2026 at 8:37 PM Ed Seidl <[email protected]> wrote:
>
>> +1 for a new ColumnOrder. It's far preferable to parsing create_by
>> strings. I can provide
>> a Rust PoC once the parquet-format PR is live.
>>
>> Ed
>>
>> On 2026/06/08 17:37:50 Divjot Arora via dev wrote:
>> > Hi folks,
>> >
>> > After more discussion on the right approach to signal validity of
>> > statistics for int96 columns, we've decided to implement option 3
>> mentioned
>> > here <https://lists.apache.org/thread/6t9fr6v602zwt0tw22bqwg81f1ny9ncj
>> >:
>> > "Formalize ordering as now defined using the timestamp ordering and
>> define
>> > a new SortOrder required for writers/readers to use stats". This
>> provides
>> > the strongest guarantee for readers to ensure that the stats are valid,
>> > which we feel is important given the risk of reading/using incorrect
>> stats.
>> > Please let me know if anyone has concerns or objections to this
>> approach. I
>> > will start drafting the parquet-format change and parquet-java
>> > implementation in parallel.
>> >
>> > -- Divjot
>> >
>> > On Thu, Jun 4, 2026 at 10:14 PM Ryan Blue <[email protected]> wrote:
>> >
>> > > I think that we need to add a sort order so that writers can signal
>> that
>> > > they produced INT96 stats with timestamp ordering. We recently added
>> a new
>> > > sort order for float and double to signal basically the same thing
>> and I
>> > > don't see why we would not do the same thing here.
>> > >
>> > > On Wed, Jun 3, 2026 at 5:48 AM Rahul Sharma via dev <
>> > > [email protected]>
>> > > wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > Reviving this thread. I'd like to land Option 1 from Micah's summary
>> > > (keep
>> > > > INT96 ordering undefined, allow-list on readers) in parquet-java
>> and I
>> > > have
>> > > > an open PR for this:
>> https://github.com/apache/parquet-java/pull/3590.
>> > > > If there are any objections, let's discuss them in this thread or
>> in the
>> > > > PR.
>> > > >
>> > > > Thanks,
>> > > > Rahul
>> > > >
>> > > >
>> > > > On Mon, Aug 4, 2025 at 7:03 AM Micah Kornfield <
>> [email protected]>
>> > > > wrote:
>> > > >
>> > > > > Gang Wu via <https://support.google.com/mail/answer/1311182?hl=en
>> >
>> > > > > parquet.apache.org
>> > > > > Thu, Jul 24, 1:19 AM (10 days ago)
>> > > > > to *dev*
>> > > > >
>> > > > > > For 1 and 2, do we need to maintain an allow-list for known
>> writer
>> > > > > > implementations
>> > > > > > as well as their versions officially? My feeling is no. Perhaps
>> it is
>> > > > the
>> > > > > > responsibility
>> > > > > > of interesting implementations to maintain it internally
>> because many
>> > > > > > projects may
>> > > > > > not even care about INT96 stats.
>> > > > >
>> > > > >
>> > > > > I think it would be unofficial as it is not part of the spec.
>> > > Including
>> > > > it
>> > > > > on the compatibility matrix might be helpful.
>> > > > >
>> > > > >
>> > > > > I prefer solutions that don't require an allow list to use INT96
>> > > stats. I
>> > > > > > don't agree that we could just let implementations handle the
>> allow
>> > > > > lists.
>> > > > > > Whatever Parquet Java implements will be copied by other people
>> and
>> > > we
>> > > > > will
>> > > > > > effectively have an allow list that is not well documented.
>> > > > >
>> > > > >
>> > > > > I think I'm OK with this.  Documenting compatibility can be done
>> via
>> > > the
>> > > > > compatibility matrix for those implementations that care about
>> this.
>> > > > >
>> > > > >
>> > > > > >
>> > > > >
>> > > > >  For 3, I think it is a bug of implementations who fail on new
>> column
>> > > > > order.
>> > > > > > If we want
>> > > > > > to move forward [1] by adding a new column order for IEEE754
>> total
>> > > > order,
>> > > > > > this bug
>> > > > > > should be fixed anyway.
>> > > > >
>> > > > >
>> > > > > I agree this would need to be fixed on the rust side for IEEE754,
>> but
>> > > > that
>> > > > > is a separate concern.  I personally don't think breaking
>> potential old
>> > > > > readers for a deprecated type, that will hopefully stop being
>> written
>> > > to
>> > > > a
>> > > > > large extent in ~1 year time, is worth the engineering effort
>> here.
>> > > > > Especially, if Spark moves away from Int96 as default, there would
>> > > > probably
>> > > > > be very few new files written with the sort order.  The real
>> question
>> > > > then
>> > > > > becomes whether we want to allow efficient pruning for existing
>> files
>> > > > that
>> > > > > are in a known good state.
>> > > > >
>> > > > > I'd really rather leave this up to implementation maintainers who
>> are
>> > > > open
>> > > > > to accepting PR's to allow listing specific implementations if
>> they
>> > > feel
>> > > > it
>> > > > > is worthwhile.
>> > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, Jul 24, 2025 at 12:04 PM Ryan Blue <[email protected]>
>> wrote:
>> > > > >
>> > > > > > I prefer solutions that don't require an allow list to use INT96
>> > > > stats. I
>> > > > > > don't agree that we could just let implementations handle the
>> allow
>> > > > > lists.
>> > > > > > Whatever Parquet Java implements will be copied by other people
>> and
>> > > we
>> > > > > will
>> > > > > > effectively have an allow list that is not well documented. I
>> think
>> > > > that
>> > > > > we
>> > > > > > need to solve this so the requirements are understood (how to
>> sort
>> > > > > values)
>> > > > > > and so that implementations can signal that a file was written
>> with
>> > > > stats
>> > > > > > that fit those requirements, without allow lists.
>> > > > > >
>> > > > > > On Thu, Jul 24, 2025 at 9:09 AM Alkis Evlogimenos
>> > > > > > <[email protected]> wrote:
>> > > > > >
>> > > > > > > My preference would be 1, 3, 2 in that order. Not super strong
>> > > > opinion
>> > > > > > > though, my take is that any of them works for the near term
>> until
>> > > the
>> > > > > > type
>> > > > > > > dies off.
>> > > > > > >
>> > > > > > > On Thu, Jul 24, 2025 at 6:46 PM Ed Seidl <[email protected]>
>> > > wrote:
>> > > > > > >
>> > > > > > > > If INT96 is to remain deprecated, I'd prefer 1. If we want a
>> > > > defined
>> > > > > > > > ordering for INT96 I'd prefer 3 to maintaining a "known
>> good"
>> > > list.
>> > > > > > > >
>> > > > > > > > As to the forward compatibility issue with rust, that's
>> already
>> > > an
>> > > > > > issue
>> > > > > > > > with logical types (and any other unions in the spec). We're
>> > > > > currently
>> > > > > > > > trying to work that [1].
>> > > > > > > >
>> > > > > > > > Cheers,
>> > > > > > > > Ed
>> > > > > > > >
>> > > > > > > > [1] https://github.com/apache/arrow-rs/issues/7909
>> > > > > > > >
>> > > > > > > > On 2025/07/24 08:19:13 Gang Wu wrote:
>> > > > > > > > > For 1 and 2, do we need to maintain an allow-list for
>> known
>> > > > writer
>> > > > > > > > > implementations
>> > > > > > > > > as well as their versions officially? My feeling is no.
>> Perhaps
>> > > > it
>> > > > > is
>> > > > > > > the
>> > > > > > > > > responsibility
>> > > > > > > > > of interesting implementations to maintain it internally
>> > > because
>> > > > > many
>> > > > > > > > > projects may
>> > > > > > > > > not even care about INT96 stats.
>> > > > > > > > >
>> > > > > > > > > For 3, I think it is a bug of implementations who fail on
>> new
>> > > > > column
>> > > > > > > > order.
>> > > > > > > > > If we want
>> > > > > > > > > to move forward [1] by adding a new column order for
>> IEEE754
>> > > > total
>> > > > > > > order,
>> > > > > > > > > this bug
>> > > > > > > > > should be fixed anyway.
>> > > > > > > > >
>> > > > > > > > > [1] https://github.com/apache/parquet-format/pull/221
>> > > > > > > > >
>> > > > > > > > > On Thu, Jul 24, 2025 at 1:30 AM Micah Kornfield <
>> > > > > > [email protected]
>> > > > > > > >
>> > > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Just to follow up on this, I think the last issues
>> remaining
>> > > > are
>> > > > > > > > updating
>> > > > > > > > > > the spec.
>> > > > > > > > > >
>> > > > > > > > > > There is already a draft PR (
>> > > > > > > > > > https://github.com/apache/parquet-format/pull/504) for
>> > > > updating
>> > > > > > the
>> > > > > > > > spec.
>> > > > > > > > > >
>> > > > > > > > > > I think there are three main options:
>> > > > > > > > > > 1.  Keep ordering for int96 undefined with an
>> implementation
>> > > > note
>> > > > > > > (the
>> > > > > > > > > > current PR does this).
>> > > > > > > > > > 2.  Formalize ordering as now defined using the
>> timestamp
>> > > > > ordering.
>> > > > > > > > > > 3.  Formalize ordering as now defined using the
>> timestamp
>> > > > > ordering
>> > > > > > > and
>> > > > > > > > > > define a new SortOrder required for writers/readers to
>> use
>> > > > stats.
>> > > > > > > > > >
>> > > > > > > > > > The main trade-offs are for options 1 and 2, we
>> potentially
>> > > > need
>> > > > > to
>> > > > > > > > allow
>> > > > > > > > > > list implementations that are known to produce valid
>> stats
>> > > > (e.g.
>> > > > > > > older
>> > > > > > > > > > versions of Rust were writing stats that didn't conform
>> to
>> > > > > > Timestamp
>> > > > > > > > > > ordering).
>> > > > > > > > > >
>> > > > > > > > > > For item #3, the main issue is that not all readers
>> might be
>> > > > > > forward
>> > > > > > > > > > compatible for a new sort order.  In particular Rust
>> readers
>> > > > > would
>> > > > > > > > break on
>> > > > > > > > > > any new files [1].
>> > > > > > > > > >
>> > > > > > > > > > Given this I suggest we move forward with the currently
>> > > opened
>> > > > PR
>> > > > > > and
>> > > > > > > > not
>> > > > > > > > > > officially formalize this in th spec.  Implementations
>> will
>> > > > need
>> > > > > to
>> > > > > > > > > > allow-list for known good writers.
>> > > > > > > > > >
>> > > > > > > > > > Thanks,
>> > > > > > > > > > Micah
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > [1] https://github.com/apache/arrow-rs/issues/7909
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Mon, Jun 30, 2025 at 8:55 AM Alkis Evlogimenos
>> > > > > > > > > > <[email protected]> wrote:
>> > > > > > > > > >
>> > > > > > > > > > > I also checked internally with the Spark OSS team and
>> the
>> > > > plan
>> > > > > > for
>> > > > > > > > having
>> > > > > > > > > > > INT64 timestamps in Spark by default is to make the
>> change
>> > > > when
>> > > > > > > > Delta v5
>> > > > > > > > > > > and Iceberg v4 are proposed. This is expected to
>> happen
>> > > > around
>> > > > > > the
>> > > > > > > > first
>> > > > > > > > > > > half of 2026.
>> > > > > > > > > > >
>> > > > > > > > > > > On Wed, Jun 25, 2025 at 8:41 PM Andrew Lamb <
>> > > > > > > [email protected]>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > We had a good discussion about this at the sync
>> today.
>> > > > Here
>> > > > > is
>> > > > > > > my
>> > > > > > > > > > > summary
>> > > > > > > > > > > >
>> > > > > > > > > > > > * Pedantically, according to the current spec[1]
>> there is
>> > > > no
>> > > > > > > > defined
>> > > > > > > > > > > > ordering for Int96 types and thus arrow-rs can not
>> be
>> > > > writing
>> > > > > > > > > > "incorrect"
>> > > > > > > > > > > > values (as there is no definition of correct)
>> > > > > > > > > > > > * Practically speaking, arrow-rs is writing
>> something
>> > > > > different
>> > > > > > > > than
>> > > > > > > > > > > Photon
>> > > > > > > > > > > > (Databricks proprietary spark engine)
>> > > > > > > > > > > > * What Photon is doing arguably makes more sense
>> (to use
>> > > > the
>> > > > > > > > ordering
>> > > > > > > > > > of
>> > > > > > > > > > > > the only logical type to use Int96)
>> > > > > > > > > > > > * GH-7686: [Parquet] Fix int96 min/max stats
>> #7687[2]
>> > > > brings
>> > > > > > > > arrow-rs
>> > > > > > > > > > > into
>> > > > > > > > > > > > line with Photon which makes sense to me
>> > > > > > > > > > > >
>> > > > > > > > > > > > Rahul has also filed a ticket in parquet-format to
>> > > discuss
>> > > > > > > > formalizing
>> > > > > > > > > > > the
>> > > > > > > > > > > > ordering of Int96 statistics[3]
>> > > > > > > > > > > >
>> > > > > > > > > > > > In the interim, I filed a PR[4] in the
>> parquet-format
>> > > repo
>> > > > to
>> > > > > > at
>> > > > > > > > least
>> > > > > > > > > > > try
>> > > > > > > > > > > > and clarify the intent of the changes to arrow-rs
>> and
>> > > > > > > parquet-java
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > Andrew
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > [1]:
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> https://github.com/apache/parquet-format/blob/cf943c197f4fad826b14ba0c40eb0ffdab585285/src/main/thrift/parquet.thrift#L1079
>> > > > > > > > > > > > [2]: https://github.com/apache/arrow-rs/pull/7687
>> > > > > > > > > > > > [3]:
>> https://github.com/apache/parquet-format/issues/502
>> > > > > > > > > > > > [4]:
>> https://github.com/apache/parquet-format/pull/504
>> > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Wed, Jun 25, 2025 at 10:52 AM Rahul Sharma
>> > > > > > > > > > > > <[email protected]> wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > I have prepared a doc
>> > > > > > > > > > > > > <
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> https://docs.google.com/document/d/1Ox0qHYBgs_3-pNqn9V8zVQm_W6qP0lsbd2XwQnQVz1Y/edit?tab=t.0
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > to summarize and have all the relevant links in
>> one
>> > > > place.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Wed, Jun 25, 2025 at 1:32 PM Alkis Evlogimenos
>> > > > > > > > > > > > > <[email protected]> wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Spark needs to start writing INT64 nanos first
>> to be
>> > > > able
>> > > > > > to
>> > > > > > > > > > replace
>> > > > > > > > > > > > > INT96
>> > > > > > > > > > > > > > which is in nanos if data is at nano
>> granularity.
>> > > This
>> > > > is
>> > > > > > > why I
>> > > > > > > > > > > linked
>> > > > > > > > > > > > > that
>> > > > > > > > > > > > > > ticket which is a prerequisite to switching to
>> INT64
>> > > in
>> > > > > > many
>> > > > > > > > cases.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I understand the concerns around changing a
>> > > deprecated
>> > > > > > aspect
>> > > > > > > > of
>> > > > > > > > > > the
>> > > > > > > > > > > > > > parquet spec. The reason we decided to bring
>> this
>> > > > forward
>> > > > > > is
>> > > > > > > > > > because:
>> > > > > > > > > > > > > > 1. there are a lot of parquet files with the
>> right
>> > > > INT96
>> > > > > > > stats
>> > > > > > > > > > > outthere
>> > > > > > > > > > > > > > (Photon has been writing them for years)
>> > > > > > > > > > > > > > 2. all engines ignore the INT96 stats so Photon
>> > > writing
>> > > > > > them
>> > > > > > > > didn't
>> > > > > > > > > > > > break
>> > > > > > > > > > > > > > anyone
>> > > > > > > > > > > > > > 3. Spark is (slowly) moving away from INT96
>> > > > > > > > > > > > > > 4. our change is very narrow, backwards
>> compatible
>> > > and
>> > > > > can
>> > > > > > > > improve
>> > > > > > > > > > > > > current
>> > > > > > > > > > > > > > workloads while (3) is ongoing
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Let's discuss more at the sync tonight.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > If we are going to standardize an ordering for
>> > > INT96,
>> > > > > > > rather
>> > > > > > > > than
>> > > > > > > > > > > > > parsing
>> > > > > > > > > > > > > > "created_by" fields, wouldn't it make more
>> sense to
>> > > > add a
>> > > > > > new
>> > > > > > > > > > > > ColumnOrder
>> > > > > > > > > > > > > > value (like what's proposed for PARQUET-2249
>> [1])?
>> > > Then
>> > > > > we
>> > > > > > > > don't
>> > > > > > > > > > need
>> > > > > > > > > > > > to
>> > > > > > > > > > > > > > maintain a list of known good writers.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > We do not have to add another ColumnOrder value
>> since
>> > > > > INT96
>> > > > > > > is
>> > > > > > > > a
>> > > > > > > > > > > > > *physical*
>> > > > > > > > > > > > > > type and can only take timestamps in the
>> specified
>> > > > > format.
>> > > > > > > > This was
>> > > > > > > > > > > > > > arguably a design wart as it should have been a
>> > > > > > > > > > > > FIXED_LEN_BYTE_ARRAY(12)
>> > > > > > > > > > > > > > with logical type INT96_TIMESTAMP, for which a
>> > > > different
>> > > > > > > > > > ColumnOrder
>> > > > > > > > > > > > > would
>> > > > > > > > > > > > > > make sense. In this case we are lucky this is a
>> > > > physical
>> > > > > > type
>> > > > > > > > > > without
>> > > > > > > > > > > > > > logical type attached because otherwise, we
>> couldn't
>> > > > have
>> > > > > > > made
>> > > > > > > > this
>> > > > > > > > > > > > > change
>> > > > > > > > > > > > > > in a backwards compatible way as easily.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Sat, Jun 21, 2025 at 12:57 AM Ed Seidl <
>> > > > > > > [email protected]>
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > If we are going to standardize an ordering for
>> > > INT96,
>> > > > > > > rather
>> > > > > > > > than
>> > > > > > > > > > > > > parsing
>> > > > > > > > > > > > > > > "created_by" fields, wouldn't it make more
>> sense to
>> > > > > add a
>> > > > > > > new
>> > > > > > > > > > > > > ColumnOrder
>> > > > > > > > > > > > > > > value (like what's proposed for PARQUET-2249
>> [1])?
>> > > > Then
>> > > > > > we
>> > > > > > > > don't
>> > > > > > > > > > > need
>> > > > > > > > > > > > > to
>> > > > > > > > > > > > > > > maintain a list of known good writers.
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Ed
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > [1]
>> > > > https://github.com/apache/parquet-format/pull/221
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On 2025/06/19 10:15:13 Andrew Lamb wrote:
>> > > > > > > > > > > > > > > > > While INT96 is now deprecated, it's still
>> the
>> > > > > default
>> > > > > > > > > > timestamp
>> > > > > > > > > > > > > type
>> > > > > > > > > > > > > > in
>> > > > > > > > > > > > > > > > > Spark, resulting in a significant amount
>> of
>> > > > > existing
>> > > > > > > data
>> > > > > > > > > > > written
>> > > > > > > > > > > > > in
>> > > > > > > > > > > > > > > this
>> > > > > > > > > > > > > > > > > format.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > I agree with Gang and Antoine that the
>> better
>> > > > > solution
>> > > > > > is
>> > > > > > > > to
>> > > > > > > > > > > change
>> > > > > > > > > > > > > > Spark
>> > > > > > > > > > > > > > > > to write non deprecated parquet data types.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > It seems there is an issue in the Spark
>> JIRA to
>> > > do
>> > > > > > > this[1]
>> > > > > > > > but
>> > > > > > > > > > > the
>> > > > > > > > > > > > > only
>> > > > > > > > > > > > > > > > feedback on the associated PR [2] is that
>> it is a
>> > > > > > > breaking
>> > > > > > > > > > > change.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > If Spark is going to keep writing INT96
>> > > timestamps
>> > > > > > > > > > indefinitely,
>> > > > > > > > > > > I
>> > > > > > > > > > > > > > > suggest
>> > > > > > > > > > > > > > > > we un-deprecate the INT96 timestamps to
>> reflect
>> > > the
>> > > > > > > > ecosystem
>> > > > > > > > > > > > reality
>> > > > > > > > > > > > > > > that
>> > > > > > > > > > > > > > > > they will be here for a while rather than
>> > > > pretending
>> > > > > > they
>> > > > > > > > are
>> > > > > > > > > > > > really
>> > > > > > > > > > > > > > > > deprecated.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Andrew
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > [1]:
>> > > > > https://issues.apache.org/jira/browse/SPARK-51359
>> > > > > > > > > > > > > > > > [2]:
>> > > > > > > > > > > > > >
>> > > > > > > >
>> > > https://github.com/apache/spark/pull/50215#issuecomment-2715147840
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > p.s. as an aside, is anyone from DataBricks
>> > > pushing
>> > > > > > spark
>> > > > > > > > to
>> > > > > > > > > > > change
>> > > > > > > > > > > > > > > > timestamp type? Or will the focus be to
>> improve
>> > > > > INT96
>> > > > > > > > > > timestamps
>> > > > > > > > > > > > > > > instead?
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 10:50 PM Gang Wu <
>> > > > > > > [email protected]
>> > > > > > > > >
>> > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > It seems not adding too much value to
>> improve a
>> > > > > > > > deprecated
>> > > > > > > > > > > > feature
>> > > > > > > > > > > > > > > > > especially
>> > > > > > > > > > > > > > > > > when there are abundant Parquet
>> implementations
>> > > > in
>> > > > > > the
>> > > > > > > > wild.
>> > > > > > > > > > > > IIRC,
>> > > > > > > > > > > > > > > > > parquet-java
>> > > > > > > > > > > > > > > > > is planning to release 1.16.0 for new data
>> > > types
>> > > > > like
>> > > > > > > > variant
>> > > > > > > > > > > and
>> > > > > > > > > > > > > > > geometry.
>> > > > > > > > > > > > > > > > > It is
>> > > > > > > > > > > > > > > > > also the last version to support Java 8.
>> All
>> > > > > > deprecated
>> > > > > > > > APIs
>> > > > > > > > > > > > might
>> > > > > > > > > > > > > > get
>> > > > > > > > > > > > > > > > > removed
>> > > > > > > > > > > > > > > > > from 2.0.0 so I'm not sure if older Spark
>> > > > versions
>> > > > > > are
>> > > > > > > > able
>> > > > > > > > > > to
>> > > > > > > > > > > > > > > leverage the
>> > > > > > > > > > > > > > > > > int96
>> > > > > > > > > > > > > > > > > stats. The right way to go is to push
>> forward
>> > > the
>> > > > > > > > adoption of
>> > > > > > > > > > > > > > timestamp
>> > > > > > > > > > > > > > > > > logical
>> > > > > > > > > > > > > > > > > types.
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > Best,
>> > > > > > > > > > > > > > > > > Gang
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > On Thu, Jun 19, 2025 at 12:31 AM Micah
>> > > Kornfield
>> > > > <
>> > > > > > > > > > > > > > > [email protected]>
>> > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > Hi Alkis,
>> > > > > > > > > > > > > > > > > > Is this the right thread link?  It
>> seems to
>> > > be
>> > > > a
>> > > > > > > > discussion
>> > > > > > > > > > > on
>> > > > > > > > > > > > > > > Timestamp
>> > > > > > > > > > > > > > > > > > Nano support (which IIUC won't use
>> int96, but
>> > > > I'm
>> > > > > > not
>> > > > > > > > sure
>> > > > > > > > > > > this
>> > > > > > > > > > > > > > > covers
>> > > > > > > > > > > > > > > > > > changing the behavior for existing
>> > > timestamps,
>> > > > > > which
>> > > > > > > I
>> > > > > > > > > > think
>> > > > > > > > > > > > are
>> > > > > > > > > > > > > at
>> > > > > > > > > > > > > > > > > either
>> > > > > > > > > > > > > > > > > > millisecond or microsecond granularity)?
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > there will be customers that want to
>> > > interface
>> > > > > with
>> > > > > > > > legacy
>> > > > > > > > > > > > > systems
>> > > > > > > > > > > > > > > > > > > with INT96. This is why we decided in
>> doing
>> > > > > both.
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > It might help to elaborate on the
>> time-frame
>> > > > > here.
>> > > > > > > > Since
>> > > > > > > > > > it
>> > > > > > > > > > > > > > appears
>> > > > > > > > > > > > > > > > > > reference implementations of parquet
>> are not
>> > > > > > > currently
>> > > > > > > > > > > writing
>> > > > > > > > > > > > > > > > > statistics,
>> > > > > > > > > > > > > > > > > > if we merge these changes when they
>> will be
>> > > > > picked
>> > > > > > up
>> > > > > > > > in
>> > > > > > > > > > > Spark?
>> > > > > > > > > > > > > > > Would the
>> > > > > > > > > > > > > > > > > > plan be to backport the parquet-java to
>> older
>> > > > > > version
>> > > > > > > > of
>> > > > > > > > > > > Spark
>> > > > > > > > > > > > > > > (otherwise
>> > > > > > > > > > > > > > > > > > the legacy systems wouldn't really make
>> use
>> > > or
>> > > > > emit
>> > > > > > > > stats
>> > > > > > > > > > > > > anyways)?
>> > > > > > > > > > > > > > > What
>> > > > > > > > > > > > > > > > > > is the delta between Spark picking up
>> these
>> > > > > changes
>> > > > > > > and
>> > > > > > > > > > > > > > > transitioning off
>> > > > > > > > > > > > > > > > > > of Int96 by default?   Is the
>> expectation
>> > > that
>> > > > > even
>> > > > > > > > once
>> > > > > > > > > > the
>> > > > > > > > > > > > > > default
>> > > > > > > > > > > > > > > is
>> > > > > > > > > > > > > > > > > > changed in spark to not use int96,
>> there will
>> > > > be
>> > > > > a
>> > > > > > > > large
>> > > > > > > > > > > number
>> > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > users
>> > > > > > > > > > > > > > > > > > that will override the default to write
>> > > int96?
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > Thanks,
>> > > > > > > > > > > > > > > > > > Micah
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 1:35 AM Alkis
>> > > > Evlogimenos
>> > > > > > > > > > > > > > > > > >
>> <[email protected]>
>> > > > > wrote:
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > We are also driving that in parallel:
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > >
>> https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of
>> > > > > > > > > > > > > > .
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > Even when Spark defaults to INT64
>> there
>> > > will
>> > > > be
>> > > > > > old
>> > > > > > > > > > > versions
>> > > > > > > > > > > > of
>> > > > > > > > > > > > > > > Spark
>> > > > > > > > > > > > > > > > > > > running, there will be customers that
>> want
>> > > to
>> > > > > > > > interface
>> > > > > > > > > > > with
>> > > > > > > > > > > > > > legacy
>> > > > > > > > > > > > > > > > > > systems
>> > > > > > > > > > > > > > > > > > > with INT96. This is why we decided in
>> doing
>> > > > > both.
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 9:53 AM
>> Antoine
>> > > > Pitrou
>> > > > > <
>> > > > > > > > > > > > > > [email protected]
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Can we get Spark to stop emitting
>> INT96?
>> > > > They
>> > > > > > are
>> > > > > > > > not
>> > > > > > > > > > > being
>> > > > > > > > > > > > > an
>> > > > > > > > > > > > > > > > > > > > extremely good community player
>> here.
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Regards
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > Antoine.
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > On Fri, 13 Jun 2025 15:17:51 +0200
>> > > > > > > > > > > > > > > > > > > > Alkis Evlogimenos
>> > > > > > > > > > > > > > > > > > > >
>> <[email protected]
>> > > >
>> > > > > > > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > > > > > > > > Hi folks,
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > While INT96 is now deprecated,
>> it's
>> > > still
>> > > > > the
>> > > > > > > > default
>> > > > > > > > > > > > > > timestamp
>> > > > > > > > > > > > > > > > > type
>> > > > > > > > > > > > > > > > > > in
>> > > > > > > > > > > > > > > > > > > > > Spark, resulting in a significant
>> > > amount
>> > > > of
>> > > > > > > > existing
>> > > > > > > > > > > data
>> > > > > > > > > > > > > > > written
>> > > > > > > > > > > > > > > > > in
>> > > > > > > > > > > > > > > > > > > this
>> > > > > > > > > > > > > > > > > > > > > format.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Historically, parquet-mr/java has
>> not
>> > > > > emitted
>> > > > > > > or
>> > > > > > > > read
>> > > > > > > > > > > > > > > statistics
>> > > > > > > > > > > > > > > > > for
>> > > > > > > > > > > > > > > > > > > > INT96.
>> > > > > > > > > > > > > > > > > > > > > This was likely due to the fact
>> that
>> > > > > standard
>> > > > > > > > byte
>> > > > > > > > > > > > > comparison
>> > > > > > > > > > > > > > > on
>> > > > > > > > > > > > > > > > > the
>> > > > > > > > > > > > > > > > > > > > INT96
>> > > > > > > > > > > > > > > > > > > > > representation doesn't align with
>> > > logical
>> > > > > > > > > > comparisons,
>> > > > > > > > > > > > > > > potentially
>> > > > > > > > > > > > > > > > > > > > leading
>> > > > > > > > > > > > > > > > > > > > > to incorrect min/max values. This
>> is
>> > > > > > > unfortunate
>> > > > > > > > > > > because
>> > > > > > > > > > > > > > > timestamp
>> > > > > > > > > > > > > > > > > > > > filters
>> > > > > > > > > > > > > > > > > > > > > are extremely common and lack of
>> stats
>> > > > > limits
>> > > > > > > > > > > > optimization
>> > > > > > > > > > > > > > > > > > > opportunities.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Since its inception Photon <
>> > > > > > > > > > > > > > > > > >
>> https://www.databricks.com/product/photon>
>> > > > > > > > > > > > > > > > > > > > emitted
>> > > > > > > > > > > > > > > > > > > > > and utilized INT96 statistics by
>> > > > employing
>> > > > > a
>> > > > > > > > logical
>> > > > > > > > > > > > > > > comparator,
>> > > > > > > > > > > > > > > > > > > ensuring
>> > > > > > > > > > > > > > > > > > > > > their correctness. We have now
>> > > > implemented
>> > > > > > > > > > > > > > > > > > > > > <
>> > > > > > > > https://github.com/apache/parquet-java/pull/3243>
>> > > > > > > > > > the
>> > > > > > > > > > > > > same
>> > > > > > > > > > > > > > > > > support
>> > > > > > > > > > > > > > > > > > > > within
>> > > > > > > > > > > > > > > > > > > > > parquet-java.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > We'd like to get the community's
>> > > thoughts
>> > > > > on
>> > > > > > > this
>> > > > > > > > > > > > addition.
>> > > > > > > > > > > > > > We
>> > > > > > > > > > > > > > > > > > > anticipate
>> > > > > > > > > > > > > > > > > > > > > that most users may not be
>> directly
>> > > > > affected
>> > > > > > > due
>> > > > > > > > to
>> > > > > > > > > > the
>> > > > > > > > > > > > > > > declining
>> > > > > > > > > > > > > > > > > use
>> > > > > > > > > > > > > > > > > > > of
>> > > > > > > > > > > > > > > > > > > > > INT96. However, we are interested
>> in
>> > > > > > > identifying
>> > > > > > > > any
>> > > > > > > > > > > > > > potential
>> > > > > > > > > > > > > > > > > > > drawbacks
>> > > > > > > > > > > > > > > > > > > > or
>> > > > > > > > > > > > > > > > > > > > > unforeseen issues with this
>> approach.
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > > > Cheers
>> > > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] INT96 stats

Reply via email to