zhongyujiang commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1481237476

   > Thus, to solve the problem of only-NaN pages, the comments in the spec are 
extended to mandate the following behavior:
   > 
   > Once a writer writes the nan_count/nan_counts fields, they have to:
   > never write NaN into min/max if there are non-NaN non-Null values and
   > always write min=max=NaN if the only non-null values in a page are NaN
   > A reader observing that nan_count/nan_counts field was written can then 
rely on that if min or max are NaN, then both have to be NaN and this means 
that the only non-NULL values are NaN.
   
   Instead of writing min and max as NaN when there are only NaN values and 
then let the reader to check whether min max  NaN are credible by evaluating 
whether naNCounts is empty, wouldn't it be much simpler if we just left the 
evaluation of isNaN and notNaN to the reader?
   A reader can always conclude a page / column is all NaN when value count of 
the field == NaN count of the filed (when valueCounts and naNCounts both 
exists), this's Iceberg's current way of [evaluating 
isNaN](https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/api/src/main/java/org/apache/iceberg/expressions/StrictMetricsEvaluator.java#L486).
  Is there anything wrong with doing this in Parquet?
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to