wgtmac commented on code in PR #514:
URL: https://github.com/apache/parquet-format/pull/514#discussion_r2973696340
##########
src/main/thrift/parquet.thrift:
##########
@@ -1199,6 +1286,17 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated
before
* using them by inspecting null_pages.
+ *
+ * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+ * NaN values are not to be included in these bounds. If all non-null values
+ * of a page are NaN, then a writer must do the following:
+ * - If the order of this column is TYPE_ORDER, then no column index
+ * must be written for this column chunk. While this is unfortunate for
+ * performance, it is necessary to avoid conflict with legacy files that
+ * still included NaN in min_values and max_values even if the page had
+ * non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
+ * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+ * and max_values[i] of that page must be set to a standard NaN value.
Review Comment:
I've reverted the edit to `write standard NaN value when all non-null values
are NaN`. Now the behavior is consistent for both stats and page index that
smallest and largest NaN values are written in the page index and column stats
if all non-null values are NaN. I think this make more sense to use the ieee754
total order and less confusing to readers. @etseidl @emkornfield
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]