[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

via GitHub Tue, 27 Jun 2023 03:07:57 -0700


JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1243489728



##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list<bool> nan_pages

Review Comment:
   
   What is a good path forward then? I see the following options:
   
   a) Ship this change but exclude the handling of only-Nan pages in the column 
index and only handle the other cases. Then we could still at least specify how 
to handle NaNs in the column index in cases where no "only NaN" page exists and 
these cases would then at least be well defined (only NaN pages are probably an 
edge case, so this would already allow us to filter in 99% of all cases and 
therefore get us almost to the goal).
   b) Add ColumnOrder to this proposal. (again happy to do that)
   It would be a good case to start using the ColumnOrder enum. This would also 
give us the opportunity to define `boundary_order` explicitly for this column 
order, so we could even assume an ordering.
   c) Drop this altogether and live with the fact that float / double columns 
are basically unfilterable in many cases.
   
   @gszadovszky 
   Side note: I think that the current read behavior in parquet-mr as you state 
it is not adhering to the spec and is dangerous at best. I have seen Parquet 
files which have NaN in these bounds in the wild (I don't know who wrote them) 
and since the mandate to not write NaNs to these bounds is in the spec only for 
a while ([introduced 
here](https://github.com/apache/parquet-format/commit/92ae9a3187d7673c9a40f81f40886faa20807722)),
 older writers would have been perfectly spec-conforming when writing NaN into 
these bounds, so files having NaNs here are adhering to (an older version of 
the) spec and therefore the parquet-mr read code should be robust to handle 
these cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

Reply via email to