etseidl commented on PR #7687:
URL: https://github.com/apache/arrow-rs/pull/7687#issuecomment-2992102879

   I think @emkornfield means will parquet-rs ignore `INT96` statistics on read 
and not write them. The sort order is undefined, so I think _any_ behavior, so 
long as it's consistent, is ok per the spec. But, as with many things, I think 
there's an inconsistency between the arrow and record APIs. It seems the arrow 
API will refuse to write `INT96` 
https://github.com/apache/arrow-rs/blob/1bed04c1e053e52575c6476f592c5aca3de7310f/parquet/src/arrow/arrow_writer/mod.rs#L1141-L1143
   
   AFAICT the record API _will_, and the statistics written will be ordered as 
`Vec<u32>`, which is not what's desired here. (see 
[this](https://github.com/apache/arrow-rs/blob/1bed04c1e053e52575c6476f592c5aca3de7310f/parquet/src/column/writer/mod.rs#L2517)
 test for instance).
   
   On read I believe both will treat statistics "properly" (i.e. the `Int96` 
type will be interpreted as little endian int96, with 4 byte days followed by 8 
byte nanos), but the arrow API will promptly cast to some type of timestamp or 
error.
   
   In the short term it might be best to have this crate mimic parquet-java and 
ignore `INT96` statistics if present and refuse to write them at all. We can 
revisit this PR if the community comes to a consensus and un-deprecates the 
type, or at least standardizes it rather than relying on Spark's or Impala's 
implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to