[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

via GitHub Thu, 23 Mar 2023 03:57:36 -0700


JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1145999358



##########
src/main/thrift/parquet.thrift:
##########
@@ -952,6 +961,9 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated 
before
    * using them by inspecting null_pages.
+   * For columns of type FLOAT and DOUBLE, NaN values are not to be included

Review Comment:
   Don’t we then have the same problem already for the NaN values stored in the 
actual columns? We do already serialize NaN to binary values in the columns 
themselves. There we also do not mandate a specific bit pattern. The spec does 
define float double to be IEEE compliant:
   ```
      * FLOAT - 4 bytes per value.  IEEE. Stored as little-endian.
      * DOUBLE - 8 bytes per value.  IEEE. Stored as little-endian.
   ```
   So if I see it correctly, any conforming implementation has to be able to 
handle all NaN bit patterns that IEEE allows. Otherwise they could not read the 
actual data in the columns.
   
   As you mention Java: Java has a defined way of reading IEEE bits into Java 
floats: `Float.intBitsToFloat` (and the respective method for double). Here it 
is guaranteed that all valid NaN bit patterns produce a Java Nan. From [the 
documentation](https://docs.oracle.com/javase/7/docs/api/java/lang/Float.html):
   
   > If the argument is any value in the range 0x7f800001 through 0x7fffffff or 
in the range 0xff800001 through 0xffffffff, the result is a NaN.
   
   This method is used by parquet-mr, so we should be fine here.
   
   So, to generalize, as I see it, the following holds:
   * Parquet defines FLOAT/DOUBLE to be IEEE without further mandating any bit 
patterns.
     * If a reader cannot handle all NaN bit patterns, they are not conforming 
to the spec.
     * Also, such reader would already today malfunction, as there can be NaNs 
with any bit patterns in columns already.
   * All prominent programming languages (C++, Java, Python, Go, ...) have IEEE 
compliant binary to float conversions, so this also sounds like a rather 
theoretical problem.
   
   I can also change my suggestion to not write NaNs, but that comes with its 
own challenges, as layed out in my commit message. But I first want to get a 
common understanding of why NaN bit patterns in bounds are worse than NaN bit 
patterns in the columns themselves. Maybe I'm missing something here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

Reply via email to