JFinis commented on PR #196:
URL: https://github.com/apache/parquet-format/pull/196#issuecomment-1481370812

   @zhongyujiang (as I can't answer your comment directly). Here is the problem 
with your suggestion of checking `nanCount == valueCount` for checking for only 
NaNs:
   
   > @mapleFU To your general comment (I can't answer there)
   > 
   > > The skeleton LGTM. But I wonder why if it has min/max/nan_count, it can 
decide nan by min-max. Can we just decide it by `null_count + nan_count == 
num_values`?
   > 
   > The problem is that the ColumnIndex does not have the `num_values` field, 
so using this computation to derive whether there are only NaNs would only be 
applicable to Statistics, not to the column index. Of course, we could do what 
I suggested in alternatives and give the column index a `num_values` list. Then 
this would indeed work everywhere but at the cost of an additional list.
   > 
   > So I see we have the following options:
   > 
   > * Do what I did here, i.e., use min/max to determine whether there are 
only NaNs
   > * Add a `num_values` list to the ColumnIndex
   > * Accept the fact that the column index cannot detect only-NaN pages 
(might lead to fishy semantics)
   > * Tell readers to use the `min==max==NaN` reasoning only in the column 
index, and use the `null_count + nan_count == num_values` for the statistics.
   > 
   > Which one would you suggest here?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to