etseidl commented on code in PR #514:
URL: https://github.com/apache/parquet-format/pull/514#discussion_r2968952621
##########
src/main/thrift/parquet.thrift:
##########
@@ -1199,6 +1286,17 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated
before
* using them by inspecting null_pages.
+ *
+ * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+ * NaN values are not to be included in these bounds. If all non-null values
+ * of a page are NaN, then a writer must do the following:
+ * - If the order of this column is TYPE_ORDER, then no column index
+ * must be written for this column chunk. While this is unfortunate for
+ * performance, it is necessary to avoid conflict with legacy files that
+ * still included NaN in min_values and max_values even if the page had
+ * non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
+ * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+ * and max_values[i] of that page must be set to a standard NaN value.
Review Comment:
Yes. So here in the column index, if a page has all NaNs, pick a "standard
NaN" and populate min and max with it. But up above for column stats (and
presumably page header stats), we're to write the proper min and max NaN
actually seen, ordered using IEEE total order. I think we should either do the
same both places, or as you suggest elsewhere, continue not writing _anything_
to the column stats since we can now examine the null and nan counts and see if
they sum up to num_values.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]