JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1242469224


##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list<bool> nan_pages

Review Comment:
   Thank you all for your sentiments. It looks like we have two votes for (1) 
and one for (3). Given that (1) would mean even less fields (and therefore 
faster decoding/encoding) I guess it would also solve the possible problem of a 
performance degradation due to this.
   
   Given that the majority is for (1), I would draft an update how this would 
look like. Basically:
   * Remove mentions of nan_pages
   * Add in comments that in the page index, all nan pages can be checked by 
having nan_count > 0 && min is NaN && max is NaN
   * Add comments about boundary order, as mentioned by @gszadovszky 
   
   I'll provide an update in the next days.
   
   @mapleFU would this be okay with you? You mentioned you would also be okay 
with the others.
   @pitrou Would (1) be okay for you as well?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to