Re: [PR] GH-7686: [Parquet] Fix int96 min/max stats [arrow-rs]

via GitHub Fri, 20 Jun 2025 08:43:49 -0700


etseidl commented on PR #7687:
URL: https://github.com/apache/arrow-rs/pull/7687#issuecomment-2992102879

I think @emkornfield means will parquet-rs ignore `INT96` statistics on read
and not write them. The sort order is undefined, so I think _any_ behavior, so
long as it's consistent, is ok per the spec. But, as with many things, I think
there's an inconsistency between the arrow and record APIs. It seems the arrow
API will refuse to write `INT96`
https://github.com/apache/arrow-rs/blob/1bed04c1e053e52575c6476f592c5aca3de7310f/parquet/src/arrow/arrow_writer/mod.rs#L1141-L1143

AFAICT the record API _will_, and the statistics written will be ordered as
`Vec<u32>`, which is not what's desired here. (see
[this](https://github.com/apache/arrow-rs/blob/1bed04c1e053e52575c6476f592c5aca3de7310f/parquet/src/column/writer/mod.rs#L2517)
test for instance).

On read I believe both will treat statistics "properly" (i.e. the `Int96`
type will be interpreted as little endian int96, with 4 byte days followed by 8
byte nanos), but the arrow API will promptly cast to some type of timestamp or
error.

In the short term it might be best to have this crate mimic parquet-java and
ignore `INT96` statistics if present and refuse to write them at all. We can
revisit this PR if the community comes to a consensus and un-deprecates the
type, or at least standardizes it rather than relying on Spark's or Impala's
implementation.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] GH-7686: [Parquet] Fix int96 min/max stats [arrow-rs]

Reply via email to