alamb opened a new issue, #8156:
URL: https://github.com/apache/arrow-rs/issues/8156

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   - Related to https://github.com/apache/parquet-format/issues/406
   
   @JFinis has been working on a proposal to better store statistics for 
floating point values in Parquet. The most recent proposal is here
   - https://github.com/apache/parquet-format/pull/514
   
   In order to change the format, there needs to be at least 2 open source 
implementations of a proposal
   
   There is also some question (see this link from @tustvold ) about how 
complex this would be to implement / get right. 
   
   **Describe the solution you'd like**
   
   I would like to implement a draft of the specification in 
https://github.com/apache/parquet-format/pull/514 in arrow-rs to show it is 
possible and keep the Rust implementation on the leading edge of 
implementation. 
   
   **Describe alternatives you've considered**
   - @etseidl  has implemented the IEEE 754 total order in a draft PR here: 
https://github.com/apache/arrow-rs/pull/7408
   
   We would also need to implement the `nan_count` field along with filtering 
out nans when writing statistics for floats. 
   
   Some good tests would be to
   1. Write floating point data (specified below) to a parquet file
   2. Read the metadata back and verify min/max values and `nan_count` for the 
following cases
    
   2. A column with no Nan values, 
   3. A column with a single +Nan value (should not appear in stats)
   4. A column with a single -Nan value (should not appear in stats)
   5. A column of *Only* Nan values 
   6. A column with Inf and some +/- Nans
   7. A column with -Inf and some +/- Nans
   
   
   **Additional context**
   * Original JIRA issue: https://issues.apache.org/jira/browse/PARQUET-2249
   * Mailing list discussion: 
https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to