etseidl commented on code in PR #514:
URL: https://github.com/apache/parquet-format/pull/514#discussion_r2967590578


##########
src/main/thrift/parquet.thrift:
##########
@@ -1199,6 +1286,17 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated 
before
    * using them by inspecting null_pages.
+   *
+   * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+   * NaN values are not to be included in these bounds. If all non-null values
+   * of a page are NaN, then a writer must do the following:
+   * - If the order of this column is TYPE_ORDER, then no column index
+   *   must be written for this column chunk. While this is unfortunate for
+   *   performance, it is necessary to avoid conflict with legacy files that
+   *   still included NaN in min_values and max_values even if the page had
+   *   non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
+   * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+   *   and max_values[i] of that page must be set to a standard NaN value.

Review Comment:
   This is inconsistent with the definition above:
   ```
      * - Min and max statistics must contain the smallest and largest non-NaN
      *   values respectively, or if all non-null values are NaN, the smallest 
and
      *   largest NaN values as defined by IEEE 754 total order.
   ```
   Before starting on a PoC I'd like to clear up whether we're going to use 
total ordering for NaNs or use any NaN in the all-NaN case. (This already came 
up in the java PoC 
https://github.com/apache/parquet-java/pull/3393#discussion_r2948507289)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to