Hi everyone, Thank you for the input so far on the PRs. Given that the parquet-format PR [1] has two approvals, I plan to give it another week before starting an official vote. Please take a look at the spec and implementation PRs if interested, all feedback is welcome.
-- Divjot Arora [1] https://github.com/apache/parquet-format/pull/584 On Mon, Jun 15, 2026 at 9:20 PM Divjot Arora <[email protected]> wrote: > Thanks for the input folks. I've gone ahead and opened up a parquet-format > PR to add this new sort order [1] as well as a parquet-java reference > implementation PR [2]. Ed has also graciously opened an arrow-rs PR with a > reference implementation [3]. Thank you everyone who's already reviewed > these PRs, the feedback has been super helpful. Please take a look and > leave comments if you're interested. > > -- Divjot Arora > > [1] https://github.com/apache/parquet-format/pull/584 > [2] https://github.com/apache/parquet-java/pull/3610 > [3] https://github.com/apache/arrow-rs/pull/10106 > > On Mon, Jun 8, 2026 at 8:37 PM Ed Seidl <[email protected]> wrote: > >> +1 for a new ColumnOrder. It's far preferable to parsing create_by >> strings. I can provide >> a Rust PoC once the parquet-format PR is live. >> >> Ed >> >> On 2026/06/08 17:37:50 Divjot Arora via dev wrote: >> > Hi folks, >> > >> > After more discussion on the right approach to signal validity of >> > statistics for int96 columns, we've decided to implement option 3 >> mentioned >> > here <https://lists.apache.org/thread/6t9fr6v602zwt0tw22bqwg81f1ny9ncj >> >: >> > "Formalize ordering as now defined using the timestamp ordering and >> define >> > a new SortOrder required for writers/readers to use stats". This >> provides >> > the strongest guarantee for readers to ensure that the stats are valid, >> > which we feel is important given the risk of reading/using incorrect >> stats. >> > Please let me know if anyone has concerns or objections to this >> approach. I >> > will start drafting the parquet-format change and parquet-java >> > implementation in parallel. >> > >> > -- Divjot >> > >> > On Thu, Jun 4, 2026 at 10:14 PM Ryan Blue <[email protected]> wrote: >> > >> > > I think that we need to add a sort order so that writers can signal >> that >> > > they produced INT96 stats with timestamp ordering. We recently added >> a new >> > > sort order for float and double to signal basically the same thing >> and I >> > > don't see why we would not do the same thing here. >> > > >> > > On Wed, Jun 3, 2026 at 5:48 AM Rahul Sharma via dev < >> > > [email protected]> >> > > wrote: >> > > >> > > > Hi all, >> > > > >> > > > Reviving this thread. I'd like to land Option 1 from Micah's summary >> > > (keep >> > > > INT96 ordering undefined, allow-list on readers) in parquet-java >> and I >> > > have >> > > > an open PR for this: >> https://github.com/apache/parquet-java/pull/3590. >> > > > If there are any objections, let's discuss them in this thread or >> in the >> > > > PR. >> > > > >> > > > Thanks, >> > > > Rahul >> > > > >> > > > >> > > > On Mon, Aug 4, 2025 at 7:03 AM Micah Kornfield < >> [email protected]> >> > > > wrote: >> > > > >> > > > > Gang Wu via <https://support.google.com/mail/answer/1311182?hl=en >> > >> > > > > parquet.apache.org >> > > > > Thu, Jul 24, 1:19 AM (10 days ago) >> > > > > to *dev* >> > > > > >> > > > > > For 1 and 2, do we need to maintain an allow-list for known >> writer >> > > > > > implementations >> > > > > > as well as their versions officially? My feeling is no. Perhaps >> it is >> > > > the >> > > > > > responsibility >> > > > > > of interesting implementations to maintain it internally >> because many >> > > > > > projects may >> > > > > > not even care about INT96 stats. >> > > > > >> > > > > >> > > > > I think it would be unofficial as it is not part of the spec. >> > > Including >> > > > it >> > > > > on the compatibility matrix might be helpful. >> > > > > >> > > > > >> > > > > I prefer solutions that don't require an allow list to use INT96 >> > > stats. I >> > > > > > don't agree that we could just let implementations handle the >> allow >> > > > > lists. >> > > > > > Whatever Parquet Java implements will be copied by other people >> and >> > > we >> > > > > will >> > > > > > effectively have an allow list that is not well documented. >> > > > > >> > > > > >> > > > > I think I'm OK with this. Documenting compatibility can be done >> via >> > > the >> > > > > compatibility matrix for those implementations that care about >> this. >> > > > > >> > > > > >> > > > > > >> > > > > >> > > > > For 3, I think it is a bug of implementations who fail on new >> column >> > > > > order. >> > > > > > If we want >> > > > > > to move forward [1] by adding a new column order for IEEE754 >> total >> > > > order, >> > > > > > this bug >> > > > > > should be fixed anyway. >> > > > > >> > > > > >> > > > > I agree this would need to be fixed on the rust side for IEEE754, >> but >> > > > that >> > > > > is a separate concern. I personally don't think breaking >> potential old >> > > > > readers for a deprecated type, that will hopefully stop being >> written >> > > to >> > > > a >> > > > > large extent in ~1 year time, is worth the engineering effort >> here. >> > > > > Especially, if Spark moves away from Int96 as default, there would >> > > > probably >> > > > > be very few new files written with the sort order. The real >> question >> > > > then >> > > > > becomes whether we want to allow efficient pruning for existing >> files >> > > > that >> > > > > are in a known good state. >> > > > > >> > > > > I'd really rather leave this up to implementation maintainers who >> are >> > > > open >> > > > > to accepting PR's to allow listing specific implementations if >> they >> > > feel >> > > > it >> > > > > is worthwhile. >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On Thu, Jul 24, 2025 at 12:04 PM Ryan Blue <[email protected]> >> wrote: >> > > > > >> > > > > > I prefer solutions that don't require an allow list to use INT96 >> > > > stats. I >> > > > > > don't agree that we could just let implementations handle the >> allow >> > > > > lists. >> > > > > > Whatever Parquet Java implements will be copied by other people >> and >> > > we >> > > > > will >> > > > > > effectively have an allow list that is not well documented. I >> think >> > > > that >> > > > > we >> > > > > > need to solve this so the requirements are understood (how to >> sort >> > > > > values) >> > > > > > and so that implementations can signal that a file was written >> with >> > > > stats >> > > > > > that fit those requirements, without allow lists. >> > > > > > >> > > > > > On Thu, Jul 24, 2025 at 9:09 AM Alkis Evlogimenos >> > > > > > <[email protected]> wrote: >> > > > > > >> > > > > > > My preference would be 1, 3, 2 in that order. Not super strong >> > > > opinion >> > > > > > > though, my take is that any of them works for the near term >> until >> > > the >> > > > > > type >> > > > > > > dies off. >> > > > > > > >> > > > > > > On Thu, Jul 24, 2025 at 6:46 PM Ed Seidl <[email protected]> >> > > wrote: >> > > > > > > >> > > > > > > > If INT96 is to remain deprecated, I'd prefer 1. If we want a >> > > > defined >> > > > > > > > ordering for INT96 I'd prefer 3 to maintaining a "known >> good" >> > > list. >> > > > > > > > >> > > > > > > > As to the forward compatibility issue with rust, that's >> already >> > > an >> > > > > > issue >> > > > > > > > with logical types (and any other unions in the spec). We're >> > > > > currently >> > > > > > > > trying to work that [1]. >> > > > > > > > >> > > > > > > > Cheers, >> > > > > > > > Ed >> > > > > > > > >> > > > > > > > [1] https://github.com/apache/arrow-rs/issues/7909 >> > > > > > > > >> > > > > > > > On 2025/07/24 08:19:13 Gang Wu wrote: >> > > > > > > > > For 1 and 2, do we need to maintain an allow-list for >> known >> > > > writer >> > > > > > > > > implementations >> > > > > > > > > as well as their versions officially? My feeling is no. >> Perhaps >> > > > it >> > > > > is >> > > > > > > the >> > > > > > > > > responsibility >> > > > > > > > > of interesting implementations to maintain it internally >> > > because >> > > > > many >> > > > > > > > > projects may >> > > > > > > > > not even care about INT96 stats. >> > > > > > > > > >> > > > > > > > > For 3, I think it is a bug of implementations who fail on >> new >> > > > > column >> > > > > > > > order. >> > > > > > > > > If we want >> > > > > > > > > to move forward [1] by adding a new column order for >> IEEE754 >> > > > total >> > > > > > > order, >> > > > > > > > > this bug >> > > > > > > > > should be fixed anyway. >> > > > > > > > > >> > > > > > > > > [1] https://github.com/apache/parquet-format/pull/221 >> > > > > > > > > >> > > > > > > > > On Thu, Jul 24, 2025 at 1:30 AM Micah Kornfield < >> > > > > > [email protected] >> > > > > > > > >> > > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > Just to follow up on this, I think the last issues >> remaining >> > > > are >> > > > > > > > updating >> > > > > > > > > > the spec. >> > > > > > > > > > >> > > > > > > > > > There is already a draft PR ( >> > > > > > > > > > https://github.com/apache/parquet-format/pull/504) for >> > > > updating >> > > > > > the >> > > > > > > > spec. >> > > > > > > > > > >> > > > > > > > > > I think there are three main options: >> > > > > > > > > > 1. Keep ordering for int96 undefined with an >> implementation >> > > > note >> > > > > > > (the >> > > > > > > > > > current PR does this). >> > > > > > > > > > 2. Formalize ordering as now defined using the >> timestamp >> > > > > ordering. >> > > > > > > > > > 3. Formalize ordering as now defined using the >> timestamp >> > > > > ordering >> > > > > > > and >> > > > > > > > > > define a new SortOrder required for writers/readers to >> use >> > > > stats. >> > > > > > > > > > >> > > > > > > > > > The main trade-offs are for options 1 and 2, we >> potentially >> > > > need >> > > > > to >> > > > > > > > allow >> > > > > > > > > > list implementations that are known to produce valid >> stats >> > > > (e.g. >> > > > > > > older >> > > > > > > > > > versions of Rust were writing stats that didn't conform >> to >> > > > > > Timestamp >> > > > > > > > > > ordering). >> > > > > > > > > > >> > > > > > > > > > For item #3, the main issue is that not all readers >> might be >> > > > > > forward >> > > > > > > > > > compatible for a new sort order. In particular Rust >> readers >> > > > > would >> > > > > > > > break on >> > > > > > > > > > any new files [1]. >> > > > > > > > > > >> > > > > > > > > > Given this I suggest we move forward with the currently >> > > opened >> > > > PR >> > > > > > and >> > > > > > > > not >> > > > > > > > > > officially formalize this in th spec. Implementations >> will >> > > > need >> > > > > to >> > > > > > > > > > allow-list for known good writers. >> > > > > > > > > > >> > > > > > > > > > Thanks, >> > > > > > > > > > Micah >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > [1] https://github.com/apache/arrow-rs/issues/7909 >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On Mon, Jun 30, 2025 at 8:55 AM Alkis Evlogimenos >> > > > > > > > > > <[email protected]> wrote: >> > > > > > > > > > >> > > > > > > > > > > I also checked internally with the Spark OSS team and >> the >> > > > plan >> > > > > > for >> > > > > > > > having >> > > > > > > > > > > INT64 timestamps in Spark by default is to make the >> change >> > > > when >> > > > > > > > Delta v5 >> > > > > > > > > > > and Iceberg v4 are proposed. This is expected to >> happen >> > > > around >> > > > > > the >> > > > > > > > first >> > > > > > > > > > > half of 2026. >> > > > > > > > > > > >> > > > > > > > > > > On Wed, Jun 25, 2025 at 8:41 PM Andrew Lamb < >> > > > > > > [email protected]> >> > > > > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > We had a good discussion about this at the sync >> today. >> > > > Here >> > > > > is >> > > > > > > my >> > > > > > > > > > > summary >> > > > > > > > > > > > >> > > > > > > > > > > > * Pedantically, according to the current spec[1] >> there is >> > > > no >> > > > > > > > defined >> > > > > > > > > > > > ordering for Int96 types and thus arrow-rs can not >> be >> > > > writing >> > > > > > > > > > "incorrect" >> > > > > > > > > > > > values (as there is no definition of correct) >> > > > > > > > > > > > * Practically speaking, arrow-rs is writing >> something >> > > > > different >> > > > > > > > than >> > > > > > > > > > > Photon >> > > > > > > > > > > > (Databricks proprietary spark engine) >> > > > > > > > > > > > * What Photon is doing arguably makes more sense >> (to use >> > > > the >> > > > > > > > ordering >> > > > > > > > > > of >> > > > > > > > > > > > the only logical type to use Int96) >> > > > > > > > > > > > * GH-7686: [Parquet] Fix int96 min/max stats >> #7687[2] >> > > > brings >> > > > > > > > arrow-rs >> > > > > > > > > > > into >> > > > > > > > > > > > line with Photon which makes sense to me >> > > > > > > > > > > > >> > > > > > > > > > > > Rahul has also filed a ticket in parquet-format to >> > > discuss >> > > > > > > > formalizing >> > > > > > > > > > > the >> > > > > > > > > > > > ordering of Int96 statistics[3] >> > > > > > > > > > > > >> > > > > > > > > > > > In the interim, I filed a PR[4] in the >> parquet-format >> > > repo >> > > > to >> > > > > > at >> > > > > > > > least >> > > > > > > > > > > try >> > > > > > > > > > > > and clarify the intent of the changes to arrow-rs >> and >> > > > > > > parquet-java >> > > > > > > > > > > > >> > > > > > > > > > > > Thanks, >> > > > > > > > > > > > Andrew >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > [1]: >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> https://github.com/apache/parquet-format/blob/cf943c197f4fad826b14ba0c40eb0ffdab585285/src/main/thrift/parquet.thrift#L1079 >> > > > > > > > > > > > [2]: https://github.com/apache/arrow-rs/pull/7687 >> > > > > > > > > > > > [3]: >> https://github.com/apache/parquet-format/issues/502 >> > > > > > > > > > > > [4]: >> https://github.com/apache/parquet-format/pull/504 >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > On Wed, Jun 25, 2025 at 10:52 AM Rahul Sharma >> > > > > > > > > > > > <[email protected]> wrote: >> > > > > > > > > > > > >> > > > > > > > > > > > > I have prepared a doc >> > > > > > > > > > > > > < >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> https://docs.google.com/document/d/1Ox0qHYBgs_3-pNqn9V8zVQm_W6qP0lsbd2XwQnQVz1Y/edit?tab=t.0 >> > > > > > > > > > > > > > >> > > > > > > > > > > > > to summarize and have all the relevant links in >> one >> > > > place. >> > > > > > > > > > > > > >> > > > > > > > > > > > > On Wed, Jun 25, 2025 at 1:32 PM Alkis Evlogimenos >> > > > > > > > > > > > > <[email protected]> wrote: >> > > > > > > > > > > > > >> > > > > > > > > > > > > > Spark needs to start writing INT64 nanos first >> to be >> > > > able >> > > > > > to >> > > > > > > > > > replace >> > > > > > > > > > > > > INT96 >> > > > > > > > > > > > > > which is in nanos if data is at nano >> granularity. >> > > This >> > > > is >> > > > > > > why I >> > > > > > > > > > > linked >> > > > > > > > > > > > > that >> > > > > > > > > > > > > > ticket which is a prerequisite to switching to >> INT64 >> > > in >> > > > > > many >> > > > > > > > cases. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > I understand the concerns around changing a >> > > deprecated >> > > > > > aspect >> > > > > > > > of >> > > > > > > > > > the >> > > > > > > > > > > > > > parquet spec. The reason we decided to bring >> this >> > > > forward >> > > > > > is >> > > > > > > > > > because: >> > > > > > > > > > > > > > 1. there are a lot of parquet files with the >> right >> > > > INT96 >> > > > > > > stats >> > > > > > > > > > > outthere >> > > > > > > > > > > > > > (Photon has been writing them for years) >> > > > > > > > > > > > > > 2. all engines ignore the INT96 stats so Photon >> > > writing >> > > > > > them >> > > > > > > > didn't >> > > > > > > > > > > > break >> > > > > > > > > > > > > > anyone >> > > > > > > > > > > > > > 3. Spark is (slowly) moving away from INT96 >> > > > > > > > > > > > > > 4. our change is very narrow, backwards >> compatible >> > > and >> > > > > can >> > > > > > > > improve >> > > > > > > > > > > > > current >> > > > > > > > > > > > > > workloads while (3) is ongoing >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Let's discuss more at the sync tonight. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > If we are going to standardize an ordering for >> > > INT96, >> > > > > > > rather >> > > > > > > > than >> > > > > > > > > > > > > parsing >> > > > > > > > > > > > > > "created_by" fields, wouldn't it make more >> sense to >> > > > add a >> > > > > > new >> > > > > > > > > > > > ColumnOrder >> > > > > > > > > > > > > > value (like what's proposed for PARQUET-2249 >> [1])? >> > > Then >> > > > > we >> > > > > > > > don't >> > > > > > > > > > need >> > > > > > > > > > > > to >> > > > > > > > > > > > > > maintain a list of known good writers. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > We do not have to add another ColumnOrder value >> since >> > > > > INT96 >> > > > > > > is >> > > > > > > > a >> > > > > > > > > > > > > *physical* >> > > > > > > > > > > > > > type and can only take timestamps in the >> specified >> > > > > format. >> > > > > > > > This was >> > > > > > > > > > > > > > arguably a design wart as it should have been a >> > > > > > > > > > > > FIXED_LEN_BYTE_ARRAY(12) >> > > > > > > > > > > > > > with logical type INT96_TIMESTAMP, for which a >> > > > different >> > > > > > > > > > ColumnOrder >> > > > > > > > > > > > > would >> > > > > > > > > > > > > > make sense. In this case we are lucky this is a >> > > > physical >> > > > > > type >> > > > > > > > > > without >> > > > > > > > > > > > > > logical type attached because otherwise, we >> couldn't >> > > > have >> > > > > > > made >> > > > > > > > this >> > > > > > > > > > > > > change >> > > > > > > > > > > > > > in a backwards compatible way as easily. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Sat, Jun 21, 2025 at 12:57 AM Ed Seidl < >> > > > > > > [email protected]> >> > > > > > > > > > > wrote: >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > > If we are going to standardize an ordering for >> > > INT96, >> > > > > > > rather >> > > > > > > > than >> > > > > > > > > > > > > parsing >> > > > > > > > > > > > > > > "created_by" fields, wouldn't it make more >> sense to >> > > > > add a >> > > > > > > new >> > > > > > > > > > > > > ColumnOrder >> > > > > > > > > > > > > > > value (like what's proposed for PARQUET-2249 >> [1])? >> > > > Then >> > > > > > we >> > > > > > > > don't >> > > > > > > > > > > need >> > > > > > > > > > > > > to >> > > > > > > > > > > > > > > maintain a list of known good writers. >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Ed >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > [1] >> > > > https://github.com/apache/parquet-format/pull/221 >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > On 2025/06/19 10:15:13 Andrew Lamb wrote: >> > > > > > > > > > > > > > > > > While INT96 is now deprecated, it's still >> the >> > > > > default >> > > > > > > > > > timestamp >> > > > > > > > > > > > > type >> > > > > > > > > > > > > > in >> > > > > > > > > > > > > > > > > Spark, resulting in a significant amount >> of >> > > > > existing >> > > > > > > data >> > > > > > > > > > > written >> > > > > > > > > > > > > in >> > > > > > > > > > > > > > > this >> > > > > > > > > > > > > > > > > format. >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > I agree with Gang and Antoine that the >> better >> > > > > solution >> > > > > > is >> > > > > > > > to >> > > > > > > > > > > change >> > > > > > > > > > > > > > Spark >> > > > > > > > > > > > > > > > to write non deprecated parquet data types. >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > It seems there is an issue in the Spark >> JIRA to >> > > do >> > > > > > > this[1] >> > > > > > > > but >> > > > > > > > > > > the >> > > > > > > > > > > > > only >> > > > > > > > > > > > > > > > feedback on the associated PR [2] is that >> it is a >> > > > > > > breaking >> > > > > > > > > > > change. >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > If Spark is going to keep writing INT96 >> > > timestamps >> > > > > > > > > > indefinitely, >> > > > > > > > > > > I >> > > > > > > > > > > > > > > suggest >> > > > > > > > > > > > > > > > we un-deprecate the INT96 timestamps to >> reflect >> > > the >> > > > > > > > ecosystem >> > > > > > > > > > > > reality >> > > > > > > > > > > > > > > that >> > > > > > > > > > > > > > > > they will be here for a while rather than >> > > > pretending >> > > > > > they >> > > > > > > > are >> > > > > > > > > > > > really >> > > > > > > > > > > > > > > > deprecated. >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > Andrew >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > [1]: >> > > > > https://issues.apache.org/jira/browse/SPARK-51359 >> > > > > > > > > > > > > > > > [2]: >> > > > > > > > > > > > > > >> > > > > > > > >> > > https://github.com/apache/spark/pull/50215#issuecomment-2715147840 >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > p.s. as an aside, is anyone from DataBricks >> > > pushing >> > > > > > spark >> > > > > > > > to >> > > > > > > > > > > change >> > > > > > > > > > > > > > > > timestamp type? Or will the focus be to >> improve >> > > > > INT96 >> > > > > > > > > > timestamps >> > > > > > > > > > > > > > > instead? >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 10:50 PM Gang Wu < >> > > > > > > [email protected] >> > > > > > > > > >> > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > It seems not adding too much value to >> improve a >> > > > > > > > deprecated >> > > > > > > > > > > > feature >> > > > > > > > > > > > > > > > > especially >> > > > > > > > > > > > > > > > > when there are abundant Parquet >> implementations >> > > > in >> > > > > > the >> > > > > > > > wild. >> > > > > > > > > > > > IIRC, >> > > > > > > > > > > > > > > > > parquet-java >> > > > > > > > > > > > > > > > > is planning to release 1.16.0 for new data >> > > types >> > > > > like >> > > > > > > > variant >> > > > > > > > > > > and >> > > > > > > > > > > > > > > geometry. >> > > > > > > > > > > > > > > > > It is >> > > > > > > > > > > > > > > > > also the last version to support Java 8. >> All >> > > > > > deprecated >> > > > > > > > APIs >> > > > > > > > > > > > might >> > > > > > > > > > > > > > get >> > > > > > > > > > > > > > > > > removed >> > > > > > > > > > > > > > > > > from 2.0.0 so I'm not sure if older Spark >> > > > versions >> > > > > > are >> > > > > > > > able >> > > > > > > > > > to >> > > > > > > > > > > > > > > leverage the >> > > > > > > > > > > > > > > > > int96 >> > > > > > > > > > > > > > > > > stats. The right way to go is to push >> forward >> > > the >> > > > > > > > adoption of >> > > > > > > > > > > > > > timestamp >> > > > > > > > > > > > > > > > > logical >> > > > > > > > > > > > > > > > > types. >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > Best, >> > > > > > > > > > > > > > > > > Gang >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > On Thu, Jun 19, 2025 at 12:31 AM Micah >> > > Kornfield >> > > > < >> > > > > > > > > > > > > > > [email protected]> >> > > > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > Hi Alkis, >> > > > > > > > > > > > > > > > > > Is this the right thread link? It >> seems to >> > > be >> > > > a >> > > > > > > > discussion >> > > > > > > > > > > on >> > > > > > > > > > > > > > > Timestamp >> > > > > > > > > > > > > > > > > > Nano support (which IIUC won't use >> int96, but >> > > > I'm >> > > > > > not >> > > > > > > > sure >> > > > > > > > > > > this >> > > > > > > > > > > > > > > covers >> > > > > > > > > > > > > > > > > > changing the behavior for existing >> > > timestamps, >> > > > > > which >> > > > > > > I >> > > > > > > > > > think >> > > > > > > > > > > > are >> > > > > > > > > > > > > at >> > > > > > > > > > > > > > > > > either >> > > > > > > > > > > > > > > > > > millisecond or microsecond granularity)? >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > there will be customers that want to >> > > interface >> > > > > with >> > > > > > > > legacy >> > > > > > > > > > > > > systems >> > > > > > > > > > > > > > > > > > > with INT96. This is why we decided in >> doing >> > > > > both. >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > It might help to elaborate on the >> time-frame >> > > > > here. >> > > > > > > > Since >> > > > > > > > > > it >> > > > > > > > > > > > > > appears >> > > > > > > > > > > > > > > > > > reference implementations of parquet >> are not >> > > > > > > currently >> > > > > > > > > > > writing >> > > > > > > > > > > > > > > > > statistics, >> > > > > > > > > > > > > > > > > > if we merge these changes when they >> will be >> > > > > picked >> > > > > > up >> > > > > > > > in >> > > > > > > > > > > Spark? >> > > > > > > > > > > > > > > Would the >> > > > > > > > > > > > > > > > > > plan be to backport the parquet-java to >> older >> > > > > > version >> > > > > > > > of >> > > > > > > > > > > Spark >> > > > > > > > > > > > > > > (otherwise >> > > > > > > > > > > > > > > > > > the legacy systems wouldn't really make >> use >> > > or >> > > > > emit >> > > > > > > > stats >> > > > > > > > > > > > > anyways)? >> > > > > > > > > > > > > > > What >> > > > > > > > > > > > > > > > > > is the delta between Spark picking up >> these >> > > > > changes >> > > > > > > and >> > > > > > > > > > > > > > > transitioning off >> > > > > > > > > > > > > > > > > > of Int96 by default? Is the >> expectation >> > > that >> > > > > even >> > > > > > > > once >> > > > > > > > > > the >> > > > > > > > > > > > > > default >> > > > > > > > > > > > > > > is >> > > > > > > > > > > > > > > > > > changed in spark to not use int96, >> there will >> > > > be >> > > > > a >> > > > > > > > large >> > > > > > > > > > > number >> > > > > > > > > > > > > of >> > > > > > > > > > > > > > > users >> > > > > > > > > > > > > > > > > > that will override the default to write >> > > int96? >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > Thanks, >> > > > > > > > > > > > > > > > > > Micah >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 1:35 AM Alkis >> > > > Evlogimenos >> > > > > > > > > > > > > > > > > > >> <[email protected]> >> > > > > wrote: >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > We are also driving that in parallel: >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > >> https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of >> > > > > > > > > > > > > > . >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > Even when Spark defaults to INT64 >> there >> > > will >> > > > be >> > > > > > old >> > > > > > > > > > > versions >> > > > > > > > > > > > of >> > > > > > > > > > > > > > > Spark >> > > > > > > > > > > > > > > > > > > running, there will be customers that >> want >> > > to >> > > > > > > > interface >> > > > > > > > > > > with >> > > > > > > > > > > > > > legacy >> > > > > > > > > > > > > > > > > > systems >> > > > > > > > > > > > > > > > > > > with INT96. This is why we decided in >> doing >> > > > > both. >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > On Wed, Jun 18, 2025 at 9:53 AM >> Antoine >> > > > Pitrou >> > > > > < >> > > > > > > > > > > > > > [email protected] >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Can we get Spark to stop emitting >> INT96? >> > > > They >> > > > > > are >> > > > > > > > not >> > > > > > > > > > > being >> > > > > > > > > > > > > an >> > > > > > > > > > > > > > > > > > > > extremely good community player >> here. >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Regards >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > Antoine. >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > On Fri, 13 Jun 2025 15:17:51 +0200 >> > > > > > > > > > > > > > > > > > > > Alkis Evlogimenos >> > > > > > > > > > > > > > > > > > > > >> <[email protected] >> > > > >> > > > > > > > > > > > > > > > > > > > wrote: >> > > > > > > > > > > > > > > > > > > > > Hi folks, >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > While INT96 is now deprecated, >> it's >> > > still >> > > > > the >> > > > > > > > default >> > > > > > > > > > > > > > timestamp >> > > > > > > > > > > > > > > > > type >> > > > > > > > > > > > > > > > > > in >> > > > > > > > > > > > > > > > > > > > > Spark, resulting in a significant >> > > amount >> > > > of >> > > > > > > > existing >> > > > > > > > > > > data >> > > > > > > > > > > > > > > written >> > > > > > > > > > > > > > > > > in >> > > > > > > > > > > > > > > > > > > this >> > > > > > > > > > > > > > > > > > > > > format. >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Historically, parquet-mr/java has >> not >> > > > > emitted >> > > > > > > or >> > > > > > > > read >> > > > > > > > > > > > > > > statistics >> > > > > > > > > > > > > > > > > for >> > > > > > > > > > > > > > > > > > > > INT96. >> > > > > > > > > > > > > > > > > > > > > This was likely due to the fact >> that >> > > > > standard >> > > > > > > > byte >> > > > > > > > > > > > > comparison >> > > > > > > > > > > > > > > on >> > > > > > > > > > > > > > > > > the >> > > > > > > > > > > > > > > > > > > > INT96 >> > > > > > > > > > > > > > > > > > > > > representation doesn't align with >> > > logical >> > > > > > > > > > comparisons, >> > > > > > > > > > > > > > > potentially >> > > > > > > > > > > > > > > > > > > > leading >> > > > > > > > > > > > > > > > > > > > > to incorrect min/max values. This >> is >> > > > > > > unfortunate >> > > > > > > > > > > because >> > > > > > > > > > > > > > > timestamp >> > > > > > > > > > > > > > > > > > > > filters >> > > > > > > > > > > > > > > > > > > > > are extremely common and lack of >> stats >> > > > > limits >> > > > > > > > > > > > optimization >> > > > > > > > > > > > > > > > > > > opportunities. >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Since its inception Photon < >> > > > > > > > > > > > > > > > > > >> https://www.databricks.com/product/photon> >> > > > > > > > > > > > > > > > > > > > emitted >> > > > > > > > > > > > > > > > > > > > > and utilized INT96 statistics by >> > > > employing >> > > > > a >> > > > > > > > logical >> > > > > > > > > > > > > > > comparator, >> > > > > > > > > > > > > > > > > > > ensuring >> > > > > > > > > > > > > > > > > > > > > their correctness. We have now >> > > > implemented >> > > > > > > > > > > > > > > > > > > > > < >> > > > > > > > https://github.com/apache/parquet-java/pull/3243> >> > > > > > > > > > the >> > > > > > > > > > > > > same >> > > > > > > > > > > > > > > > > support >> > > > > > > > > > > > > > > > > > > > within >> > > > > > > > > > > > > > > > > > > > > parquet-java. >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > We'd like to get the community's >> > > thoughts >> > > > > on >> > > > > > > this >> > > > > > > > > > > > addition. >> > > > > > > > > > > > > > We >> > > > > > > > > > > > > > > > > > > anticipate >> > > > > > > > > > > > > > > > > > > > > that most users may not be >> directly >> > > > > affected >> > > > > > > due >> > > > > > > > to >> > > > > > > > > > the >> > > > > > > > > > > > > > > declining >> > > > > > > > > > > > > > > > > use >> > > > > > > > > > > > > > > > > > > of >> > > > > > > > > > > > > > > > > > > > > INT96. However, we are interested >> in >> > > > > > > identifying >> > > > > > > > any >> > > > > > > > > > > > > > potential >> > > > > > > > > > > > > > > > > > > drawbacks >> > > > > > > > > > > > > > > > > > > > or >> > > > > > > > > > > > > > > > > > > > > unforeseen issues with this >> approach. >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > Cheers >> > > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >
