[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

via GitHub Mon, 26 Jun 2023 10:35:29 -0700


JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242530201



##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list<bool> nan_pages

Review Comment:
   @mapleFU From just reading the spec, I don't think we should have a backward 
compatibility problem, as legacy readers are already compelled to ignore NaNs 
if they find them anywhere. Thus, a legacy reader would ignore the NaN it finds 
in the column index and just not filter that page.
   
   Also note that regardless of whether we do (1), (2), or (3) [we basically 
**have to** write NaN into min and 
max](https://github.com/apache/parquet-format/pull/196#issuecomment-1491890773).
 We have to write a valid value and every value except NaN would simply be 
wrong, if a page contains only NaNs. The approaches would just differ in what 
we write **in addition**, so to a legacy reader that wouldn't read anything new 
fields, the three approaches would be equal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

Reply via email to