If we are going to standardize an ordering for INT96, rather than parsing "created_by" fields, wouldn't it make more sense to add a new ColumnOrder value (like what's proposed for PARQUET-2249 [1])? Then we don't need to maintain a list of known good writers.
Ed [1] https://github.com/apache/parquet-format/pull/221 On 2025/06/19 10:15:13 Andrew Lamb wrote: > > While INT96 is now deprecated, it's still the default timestamp type in > > Spark, resulting in a significant amount of existing data written in this > > format. > > I agree with Gang and Antoine that the better solution is to change Spark > to write non deprecated parquet data types. > > It seems there is an issue in the Spark JIRA to do this[1] but the only > feedback on the associated PR [2] is that it is a breaking change. > > If Spark is going to keep writing INT96 timestamps indefinitely, I suggest > we un-deprecate the INT96 timestamps to reflect the ecosystem reality that > they will be here for a while rather than pretending they are really > deprecated. > > Andrew > > [1]: https://issues.apache.org/jira/browse/SPARK-51359 > [2]: https://github.com/apache/spark/pull/50215#issuecomment-2715147840 > > p.s. as an aside, is anyone from DataBricks pushing spark to change > timestamp type? Or will the focus be to improve INT96 timestamps instead? > > > On Wed, Jun 18, 2025 at 10:50 PM Gang Wu <ust...@gmail.com> wrote: > > > It seems not adding too much value to improve a deprecated feature > > especially > > when there are abundant Parquet implementations in the wild. IIRC, > > parquet-java > > is planning to release 1.16.0 for new data types like variant and geometry. > > It is > > also the last version to support Java 8. All deprecated APIs might get > > removed > > from 2.0.0 so I'm not sure if older Spark versions are able to leverage the > > int96 > > stats. The right way to go is to push forward the adoption of timestamp > > logical > > types. > > > > Best, > > Gang > > > > On Thu, Jun 19, 2025 at 12:31 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > Hi Alkis, > > > Is this the right thread link? It seems to be a discussion on Timestamp > > > Nano support (which IIUC won't use int96, but I'm not sure this covers > > > changing the behavior for existing timestamps, which I think are at > > either > > > millisecond or microsecond granularity)? > > > > > > there will be customers that want to interface with legacy systems > > > > with INT96. This is why we decided in doing both. > > > > > > > > > It might help to elaborate on the time-frame here. Since it appears > > > reference implementations of parquet are not currently writing > > statistics, > > > if we merge these changes when they will be picked up in Spark? Would the > > > plan be to backport the parquet-java to older version of Spark (otherwise > > > the legacy systems wouldn't really make use or emit stats anyways)? What > > > is the delta between Spark picking up these changes and transitioning off > > > of Int96 by default? Is the expectation that even once the default is > > > changed in spark to not use int96, there will be a large number of users > > > that will override the default to write int96? > > > > > > Thanks, > > > Micah > > > > > > On Wed, Jun 18, 2025 at 1:35 AM Alkis Evlogimenos > > > <alkis.evlogime...@databricks.com.invalid> wrote: > > > > > > > We are also driving that in parallel: > > > > https://lists.apache.org/thread/y2vzrjl1499j5dvbpg3m81jxdhf4b6of. > > > > > > > > Even when Spark defaults to INT64 there will be old versions of Spark > > > > running, there will be customers that want to interface with legacy > > > systems > > > > with INT96. This is why we decided in doing both. > > > > > > > > On Wed, Jun 18, 2025 at 9:53 AM Antoine Pitrou <anto...@python.org> > > > wrote: > > > > > > > > > > > > > > Can we get Spark to stop emitting INT96? They are not being an > > > > > extremely good community player here. > > > > > > > > > > Regards > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > On Fri, 13 Jun 2025 15:17:51 +0200 > > > > > Alkis Evlogimenos > > > > > <alkis.evlogime...@databricks.com.INVALID> > > > > > wrote: > > > > > > Hi folks, > > > > > > > > > > > > While INT96 is now deprecated, it's still the default timestamp > > type > > > in > > > > > > Spark, resulting in a significant amount of existing data written > > in > > > > this > > > > > > format. > > > > > > > > > > > > Historically, parquet-mr/java has not emitted or read statistics > > for > > > > > INT96. > > > > > > This was likely due to the fact that standard byte comparison on > > the > > > > > INT96 > > > > > > representation doesn't align with logical comparisons, potentially > > > > > leading > > > > > > to incorrect min/max values. This is unfortunate because timestamp > > > > > filters > > > > > > are extremely common and lack of stats limits optimization > > > > opportunities. > > > > > > > > > > > > Since its inception Photon < > > > https://www.databricks.com/product/photon> > > > > > emitted > > > > > > and utilized INT96 statistics by employing a logical comparator, > > > > ensuring > > > > > > their correctness. We have now implemented > > > > > > <https://github.com/apache/parquet-java/pull/3243> the same > > support > > > > > within > > > > > > parquet-java. > > > > > > > > > > > > We'd like to get the community's thoughts on this addition. We > > > > anticipate > > > > > > that most users may not be directly affected due to the declining > > use > > > > of > > > > > > INT96. However, we are interested in identifying any potential > > > > drawbacks > > > > > or > > > > > > unforeseen issues with this approach. > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >