gszadovszky commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1247575939
##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
/** A list containing the number of null values for each page **/
5: optional list<i64> null_counts
+
+ /**
+ * A list of Boolean values to determine pages that contain only NaNs. Only
+ * present for columns of type FLOAT and DOUBLE. If true, all non-null
+ * values in a page are NaN. Writers are suggested to set the corresponding
+ * entries in min_values and max_values to NaN, so that all lists have the
same
+ * length and contain valid values. If false, then either all values in the
+ * page are null or there is at least one non-null non-NaN value in the page.
+ * As readers are supposed to ignore all NaN values in bounds, legacy readers
+ * who do not consider nan_pages yet are still able to use the column index
+ * but are not able to skip only-NaN pages.
+ */
+ 6: optional list<bool> nan_pages
Review Comment:
@mapleFU, I did not think about any specific implementation. (TBH, I only
have experince with parquet-mr.) This is mentioned in the PR description.
Maybe, we do not have any implementations as such.
@JFinis, I agree we should not care about the potential systems already
writing NaN values into column indexes. Also agree that writing NaN values to
min/max is risky for existing systems. So we need to write non-NaN valid values
to min/max for all-NaN pages. (And of course mark them with either `nan_pages`
or `value_counts`.)
The more we narrow the range the higher the chance the page will be dropped
during filtering which is good because we should not search for NaN values
based on the spec anyway. What do you think about `[-Inf, -Inf]`? The worst
case is we will read the page of all NaN values instead of dropping. In this
very case we would not drop it for `< x` like cases. (This turned out to be the
rephrasing and summary of your previous comments. :smile: )
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]