[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

via GitHub Wed, 28 Jun 2023 07:18:26 -0700


JFinis commented on code in PR #196:
URL: https://github.com/apache/parquet-format/pull/196#discussion_r1245289770



##########
src/main/thrift/parquet.thrift:
##########
@@ -966,6 +985,23 @@ struct ColumnIndex {
 
   /** A list containing the number of null values for each page **/
   5: optional list<i64> null_counts
+
+  /**
+   * A list of Boolean values to determine pages that contain only NaNs. Only
+   * present for columns of type FLOAT and DOUBLE. If true, all non-null
+   * values in a page are NaN. Writers are suggested to set the corresponding
+   * entries in min_values and max_values to NaN, so that all lists have the 
same
+   * length and contain valid values. If false, then either all values in the
+   * page are null or there is at least one non-null non-NaN value in the page.
+   * As readers are supposed to ignore all NaN values in bounds, legacy readers
+   * who do not consider nan_pages yet are still able to use the column index
+   * but are not able to skip only-NaN pages.
+   */
+  6: optional list<bool> nan_pages

Review Comment:
   @gszadovszky Any writer reader/writer pair who writes NaNs into column 
indexes (and other places like page headers) and expects them to be there (and 
otherwise yields wrong results while reading) *is and never was* spec 
conforming. In older releases of the spec where NaN wasn't mentioned yet, such 
a writer was at least not violating the spec directly but even then NaN 
handling was basically "undefined behavior", as the spec never mentioned how to 
treat NaNs. Thus, relying on *one specific* behavior w.r.t. NaNs was already 
back then a non-portable assumption.
   
   Even today, a reader relying on one specific NaN semantics would already 
yield erroneous results when reading spec conforming Parquet files. E.g., if 
they search for NaNs and expect them to be in min/max, then they might filter 
Pages containing NaNs that don't have NaNs in their min/max. Consequently, such 
a reader is already broken; yes, writing [-Inf,Inf] into the column index would 
break such a reader more, but all bets are off here anyway already. It 
currently is just not possible to handle NaNs correctly in a portable way 
(that's what this PR is all about in the first place).
   
   So TBH backward compatibility to such a broken (or at least non-portable) 
reader/writer pair seems like an absolute non-goal to me.
   
   @pitrou A legacy reader who doesn't handle the new NaN semantics doesn't 
need to distinguish here. All they need to know is whether they should skip the 
page or shouldn't. A page with [-Inf,+Inf] can never be skipped, so regardless 
of whether the bounds are there due to NaNs or real infinities, a legacy reader 
would yield correct results. A new reader that implements this PR can do the 
distinction via the nan_pages or value_counts computation.
   
   Note that actually *any* bounds are, mathematically speaking, correct for a 
page containing only NaNs (and will yield correct results on spec-conforming 
readers!). Note that min/max values in the column index don't need to be tight, 
according to the spec. So the only condition that must hold is that there is no 
value outside of the bounds (NaNs excluded). As an only-NaN page has no values, 
any bounds satisfy the condition, as there are no values that need to lie 
inside them. So instead of [-Inf,+Inf] we could also choose [0,0] or [42,1337]. 
Both would yield correct results on spec conforming readers. Actually the 
tighter the bounds, the more queries can skip the page.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] JFinis commented on a diff in pull request #196: PARQUET-2249: Add nan_count to handle NaNs in statistics

Reply via email to