I have prepared a doc <https://docs.google.com/document/d/1Ox0qHYBgs_3-pNqn9V8zVQm_W6qP0lsbd2XwQnQVz1Y/edit?tab=t.0> to summarize and have all the relevant links in one place.
On Wed, Jun 25, 2025 at 1:32 PM Alkis Evlogimenos <alkis.evlogime...@databricks.com.invalid> wrote: > Spark needs to start writing INT64 nanos first to be able to replace INT96 > which is in nanos if data is at nano granularity. This is why I linked that > ticket which is a prerequisite to switching to INT64 in many cases. > > I understand the concerns around changing a deprecated aspect of the > parquet spec. The reason we decided to bring this forward is because: > 1. there are a lot of parquet files with the right INT96 stats outthere > (Photon has been writing them for years) > 2. all engines ignore the INT96 stats so Photon writing them didn't break > anyone > 3. Spark is (slowly) moving away from INT96 > 4. our change is very narrow, backwards compatible and can improve current > workloads while (3) is ongoing > > Let's discuss more at the sync tonight. > > > If we are going to standardize an ordering for INT96, rather than parsing > "created_by" fields, wouldn't it make more sense to add a new ColumnOrder > value (like what's proposed for PARQUET-2249 [1])? Then we don't need to > maintain a list of known good writers. > > We do not have to add another ColumnOrder value since INT96 is a *physical* > type and can only take timestamps in the specified format. This was > arguably a design wart as it should have been a FIXED_LEN_BYTE_ARRAY(12) > with logical type INT96_TIMESTAMP, for which a different ColumnOrder would > make sense. In this case we are lucky this is a physical type without > logical type attached because otherwise, we couldn't have made this change > in a backwards compatible way as easily. > > On Sat, Jun 21, 2025 at 12:57 AM Ed Seidl <etse...@apache.org> wrote: > > > If we are going to standardize an ordering for INT96, rather than parsing > > "created_by" fields, wouldn't it make more sense to add a new ColumnOrder > > value (like what's proposed for PARQUET-2249 [1])? Then we don't need to > > maintain a list of known good writers. > > > > Ed > > > > [1] https://github.com/apache/parquet-format/pull/221 > > > > On 2025/06/19 10:15:13 Andrew Lamb wrote: > > > > While INT96 is now deprecated, it's still the default timestamp type > in > > > > Spark, resulting in a significant amount of existing data written in > > this > > > > format. > > > > > > I agree with Gang and Antoine that the better solution is to change > Spark > > > to write non deprecated parquet data types. > > > > > > It seems there is an issue in the Spark JIRA to do this[1] but the only > > > feedback on the associated PR [2] is that it is a breaking change. > > > > > > If Spark is going to keep writing INT96 timestamps indefinitely, I > > suggest > > > we un-deprecate the INT96 timestamps to reflect the ecosystem reality > > that > > > they will be here for a while rather than pretending they are really > > > deprecated. > > > > > > Andrew > > > > > > [1]: https://issues.apache.org/jira/browse/SPARK-51359 > > > [2]: > https://github.com/apache/spark/pull/50215#issuecomment-2715147840 > > > > > > p.s. as an aside, is anyone from DataBricks pushing spark to change > > > timestamp type? Or will the focus be to improve INT96 timestamps > > instead? > > > > > > > > > On Wed, Jun 18, 2025 at 10:50 PM Gang Wu <ust...@gmail.com> wrote: > > > > > > > It seems not adding too much value to improve a deprecated feature > > > > especially > > > > when there are abundant Parquet implementations in the wild. IIRC, > > > > parquet-java > > > > is planning to release 1.16.0 for new data types like variant and > > geometry. > > > > It is > > > > also the last version to support Java 8. All deprecated APIs might > get > > > > removed > > > > from 2.0.0 so I'm not sure if older Spark versions are able to > > leverage the > > > > int96 > > > > stats. The right way to go is to push forward the adoption of > timestamp > > > > logical > > > > types. > > > > > > > > Best, > > > > Gang > > > > > > > > On Thu, Jun 19, 2025 at 12:31 AM Micah Kornfield < > > emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > Hi Alkis, > > > > > Is this the right thread link? It seems to be a discussion on > > Timestamp > > > > > Nano support (which IIUC won't use int96, but I'm not sure this > > covers > > > > > changing the behavior for existing timestamps, which I think are at > > > > either > > > > > millisecond or microsecond granularity)? > > > > > > > > > > there will be customers that want to interface with legacy systems > > > > > > with INT96. This is why we decided in doing both. > > > > > > > > > > > > > > > It might help to elaborate on the time-frame here. Since it > appears > > > > > reference implementations of parquet are not currently writing > > > > statistics, > > > > > if we merge these changes when they will be picked up in Spark? > > Would the > > > > > plan be to backport the parquet-java to older version of Spark > > (otherwise > > > > > the legacy systems wouldn't really make use or emit stats anyways)? > > What > > > > > is the delta between Spark picking up these changes and > > transitioning off > > > > > of Int96 by default? Is the expectation that even once the > default > > is > > > > > changed in spark to not use int96, there will be a large number of > > users > > > > > that will override the default to write int96? > > > > > > > > > > Thanks, > > > > > Micah > > > > > > > > > > On Wed, Jun 18, 2025 at 1:35 AM Alkis Evlogimenos > > > > > <alkis.evlogime...@databricks.com.invalid> wrote: > > > > > > > > > > > We are also driving that in parallel: > > > > > > https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of > . > > > > > > > > > > > > Even when Spark defaults to INT64 there will be old versions of > > Spark > > > > > > running, there will be customers that want to interface with > legacy > > > > > systems > > > > > > with INT96. This is why we decided in doing both. > > > > > > > > > > > > On Wed, Jun 18, 2025 at 9:53 AM Antoine Pitrou < > anto...@python.org > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Can we get Spark to stop emitting INT96? They are not being an > > > > > > > extremely good community player here. > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > On Fri, 13 Jun 2025 15:17:51 +0200 > > > > > > > Alkis Evlogimenos > > > > > > > <alkis.evlogime...@databricks.com.INVALID> > > > > > > > wrote: > > > > > > > > Hi folks, > > > > > > > > > > > > > > > > While INT96 is now deprecated, it's still the default > timestamp > > > > type > > > > > in > > > > > > > > Spark, resulting in a significant amount of existing data > > written > > > > in > > > > > > this > > > > > > > > format. > > > > > > > > > > > > > > > > Historically, parquet-mr/java has not emitted or read > > statistics > > > > for > > > > > > > INT96. > > > > > > > > This was likely due to the fact that standard byte comparison > > on > > > > the > > > > > > > INT96 > > > > > > > > representation doesn't align with logical comparisons, > > potentially > > > > > > > leading > > > > > > > > to incorrect min/max values. This is unfortunate because > > timestamp > > > > > > > filters > > > > > > > > are extremely common and lack of stats limits optimization > > > > > > opportunities. > > > > > > > > > > > > > > > > Since its inception Photon < > > > > > https://www.databricks.com/product/photon> > > > > > > > emitted > > > > > > > > and utilized INT96 statistics by employing a logical > > comparator, > > > > > > ensuring > > > > > > > > their correctness. We have now implemented > > > > > > > > <https://github.com/apache/parquet-java/pull/3243> the same > > > > support > > > > > > > within > > > > > > > > parquet-java. > > > > > > > > > > > > > > > > We'd like to get the community's thoughts on this addition. > We > > > > > > anticipate > > > > > > > > that most users may not be directly affected due to the > > declining > > > > use > > > > > > of > > > > > > > > INT96. However, we are interested in identifying any > potential > > > > > > drawbacks > > > > > > > or > > > > > > > > unforeseen issues with this approach. > > > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >