Re: [DISCUSS] INT96 stats

Alkis Evlogimenos Fri, 13 Jun 2025 10:42:09 -0700

Yes, you're right that some writers might emit incorrect statistics.

That's why we included the compatibility check logic in the PR [1]. This
approach ensures that we verify the writer's capabilities before utilizing
the INT96 statistics. This is a common practice, and the C++ implementation
employs similar checks to protect against writer-related issues [2].


I believe this change is safe precisely because the sort order for INT96 is
undefined by the spec. No reader should have been relying on it. This is
further supported by the fact that Photon has been successfully writing and
utilizing INT96 statistics for years without any reported issues. If
readers were misinterpreting these statistics, we likely would have heard
about it.

Regarding the undefined order, we could actually define it in a sensible
way. Since INT96 is a base type, defining a specific order wouldn't break
existing readers. However, if INT96 were a converted or logical type with a
base type of `FIXED_LEN_BYTE_ARRAY(12)`, then defining an order could
potentially cause compatibility problems with existing readers.

[1]
https://github.com/apache/parquet-java/pull/3243/files#diff-5a411f9c4979615ddc34eb42a939d5de3c2a3cd50face8647eeaec203d1fc98bR44-R62
[2]
https://github.com/apache/arrow/blob/2ba455f17e7cdbfe2b2f1aa3dfb2bf00878a17e1/cpp/src/parquet/metadata.cc#L1543-L1581

On Fri, Jun 13, 2025 at 7:13 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Looking more closely the C++ code is quite old (not even in Arrow repo),
> and the current code [1] looks like it matches the spec
>
> [1]
>
> https://github.com/apache/arrow/blob/2ba455f17e7cdbfe2b2f1aa3dfb2bf00878a17e1/cpp/src/parquet/types.cc#L302
>
> On Fri, Jun 13, 2025 at 9:55 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Alkis,
> >
> > I don't think this is just an implementation detail, the spec currently
> > explicitly states int96 sort order is undefined [1].
> >
> > Despite this, a quick scan of the C++ seems to indicate it might be
> > reading/writing stats for int96 (again only looked quickly but I couldn't
> > find guards against it).  There is an old bug [2] on correcting
> comparisons
> > (I didn't look closely to see if these align with the Parquet-java
> changes
> > proposed), so there is a chance that files in the wild written by C++
> might
> > have incorrect statistics.
> >
> > Cheers,
> > Micah
> >
> > [1]
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079
> > [2] https://github.com/apache/parquet-cpp/pull/399/files
> >
> >
> > On Fri, Jun 13, 2025 at 6:24 AM Alkis Evlogimenos
> > <alkis.evlogime...@databricks.com.invalid> wrote:
> >
> >> Hi folks,
> >>
> >> While INT96 is now deprecated, it's still the default timestamp type in
> >> Spark, resulting in a significant amount of existing data written in
> this
> >> format.
> >>
> >> Historically, parquet-mr/java has not emitted or read statistics for
> >> INT96.
> >> This was likely due to the fact that standard byte comparison on the
> INT96
> >> representation doesn't align with logical comparisons, potentially
> leading
> >> to incorrect min/max values. This is unfortunate because timestamp
> filters
> >> are extremely common and lack of stats limits optimization
> opportunities.
> >>
> >> Since its inception Photon <https://www.databricks.com/product/photon>
> >> emitted
> >> and utilized INT96 statistics by employing a logical comparator,
> ensuring
> >> their correctness. We have now implemented
> >> <https://github.com/apache/parquet-java/pull/3243> the same support
> >> within
> >> parquet-java.
> >>
> >> We'd like to get the community's thoughts on this addition. We
> anticipate
> >> that most users may not be directly affected due to the declining use of
> >> INT96. However, we are interested in identifying any potential drawbacks
> >> or
> >> unforeseen issues with this approach.
> >>
> >> Cheers
> >>
> >
>

Re: [DISCUSS] INT96 stats

Reply via email to