yyanyy commented on pull request #2069:
URL: https://github.com/apache/iceberg/pull/2069#issuecomment-763080534


   > For ORC metrics, which currently don't populate `nanValueCounts` in the 
metrics and for which it seems difficult to populate `nanValueCounts`, will 
this still work when we create `Metrics` using the non-deprecated constructors 
that require passing `nanValueCounts`? This is in reference to this issue for 
removing the `Metrics` constructors that don't take in `nanValueCounts` maps: 
#2044
   > 
   > I suppose when we go to deprecate the old `Metrics` constructors, we can 
simply pass `null` into the `Metrics` constructor for `nanValueCounts` with ORC 
in order to keep the current behavior. Or update the various evaluator code 
paths to ensure that a missing key from this map does not indicate anything 
special if we choose to pass in an empty map.
   > 
   > This is all based off of my understanding of ORC's `ColumnStatistics` and 
the notes in this PR about `upperBound ` and `lowerBound` being `NaN` in ORC if 
_any_ value in the file is `NaN`, so please correct me if I'm mistaken and 
somebody sees an easy way to get NaN stats from an ORC file. To me, I fail to 
see how we can get the count of NaN values in the following code block (though 
as I mentioned we can always continue to pass in `null` to match the existing 
behavior, provided the various metrics evaluators are aware that `null` for 
that Map entry for nullValueCounts doesnt convey much meaning for ORC).
   > 
   > We could also potentially track NaN values in the various orc writers, 
like the parquet writers do.
   > 
   > The relevant code for populating ORC metrics is here:
   > 
   > 
https://github.com/apache/iceberg/blob/29915e3522f9a808420a2e16fb1c2cf3ae165a74/orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java#L100-L172
   
   Thanks for your interests in this PR! You are right that some of this PR 
will not be working without NaN counter for checking if a field contains 
partial/all NaN, and it's hard to get ORC `nanValueCounts` from ORC file 
directly. I think today evaluators should treat missing keys from this map or 
lack of such map the same way as not having NaN value count information for a 
given field, or do you think we are not doing this in some places?
   
   Regarding populating NaN counter for ORC files, yes we are going to track 
NaN values in various ORC writers just as we do for parquet writers, and the 
change is currently in review in #1790. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to