gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243234931
##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+
+ /**
+ * A list of Boolean values to determine pages that contain only NaNs. Only
+ * present for columns of type FLOAT and DOUBLE. If true, all non-null
+ * values in a page are NaN. Writers are suggested to set the corresponding
+ * entries in min_values and max_values to NaN, so that all lists have the
same
+ * length and contain valid values. If false, then either all values in the
+ * page are null or there is at least one non-null non-NaN value in the page.
+ * As readers are supposed to ignore all NaN values in bounds, legacy readers
+ * who do not consider nan_pages yet are still able to use the column index
+ * but are not able to skip only-NaN pages.
+ */
+ 6: optional list<bool> nan_pages
Review Comment:
@mapleFU, it seems to me that NaN is only checked for column indexes at the
write path in parquet-mr. (In this case the column index will be invalid and
won't be written to the file.) For the read path, though, there is no such
check. It means that legacy readers can come to incorrect results using
FLOAT/DOUBLE column indexes after we start writing NaN values. (Sorry for the
late conclusion, I've thought this check was implemented for both directions.)
The only way I can think of for backward compatible NaN handling is to
define a
[ColumnOrder](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L863)
for FP values that includes NaNs as well. In case of we would also add support
to row-group level statistics with NaNs. parquet-mr currently skip all kinds of
min/max statistics for columns with not supported column orders.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]