JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1145999358
##########
src/main/thrift/parquet.thrift:
##########
@@ -952,6 +961,9 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated
before
* using them by inspecting null_pages.
+ * For columns of type FLOAT and DOUBLE, NaN values are not to be included
Review Comment:
Don’t we then have the same problem already for the NaN values stored in the
actual columns? We do already serialize NaN to binary values in the columns
themselves. There we also do not mandate a specific bit pattern. The spec does
define float double to be IEEE compliant:
```
* FLOAT - 4 bytes per value. IEEE. Stored as little-endian.
* DOUBLE - 8 bytes per value. IEEE. Stored as little-endian.
```
So if I see it correctly, any conforming implementation has to be able to
handle all NaN bit patterns that IEEE allows. Otherwise they could not read the
actual data in the columns.
As you mention Java: Java has a defined way of reading IEEE bits into Java
floats: `Float.intBitsToFloat` (and the respective method for double). Here it
is guaranteed that all valid NaN bit patterns produce a Java Nan. From [the
documentation](https://docs.oracle.com/javase/7/docs/api/java/lang/Float.html):
> If the argument is any value in the range 0x7f800001 through 0x7fffffff or
in the range 0xff800001 through 0xffffffff, the result is a NaN.
This method is used by parquet-mr, so we should be fine here.
So, to generalize, as I see it, the following holds:
* Parquet defines FLOAT/DOUBLE to be IEEE without further mandating any bit
patterns.
* If a reader cannot handle all NaN bit patterns, they are not conforming
to the spec.
* Also, such reader would already today malfunction, as there can be NaNs
with any bit patterns in columns already.
* All prominent programming languages (C++, Java, Python, Go, ...) have IEEE
compliant binary to float conversions, so this also sounds like a rather
theoretical problem.
I can also change my suggestion to not write NaNs, but that comes with its
own challenges, as layed out in my commit message. But I first want to get a
common understanding of why NaN bit patterns in bounds are worse than NaN bit
patterns in the columns themselves. Maybe I'm missing something here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]