JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1237384626


##########
src/main/thrift/parquet.thrift:
##########
@@ -886,16 +891,25 @@ union ColumnOrder {
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
    * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
-   *     compatibility rules should be applied when reading statistics:
+   *     point values (relations vs. total ordering), the following 
compatibility
+   *     rules should be applied when reading statistics:
    *     - If the min is a NaN, it should be ignored.
    *     - If the max is a NaN, it should be ignored.
+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-NULL
+   *       values are NaN.
+   *     - When looking for NaN values, min and max should be ignored.
+   *       If the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
-   *     - When looking for NaN values, min and max should be ignored.
    * 
    *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     - It is suggested to always set the nan_count fields for FLOAT and
+           DOUBLE columns.
+   *     - NaNs should not be written to min or max statistics fields except
+   *       in the column index, where a value has to be written incase of

Review Comment:
   I don't fully understand your question.
   
   We have to write nan_pages and nan_counts *and* we also have to write NaN 
values to the actual min and max in the column index, as we have to write a 
valid double value to the bounds and NaN is the only correct double value in 
case all values are NaN, as pointed out by @gszadovszky 
[here](https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to