WZhuo opened a new pull request, #727: URL: https://github.com/apache/iceberg-cpp/pull/727
## What Collects NaN value counts for float and double columns during Parquet writes, since the Parquet footer statistics do not track NaN counts. ## Changes - **Write-side NaN metric collection** (`FieldMetricsCollector`): A visitor that walks each record batch before writing, accumulating value counts, null counts, NaN counts, and NaN-excluding lower/upper bounds for float/double fields. - **MetricsConfig-aware skipping**: Fields whose `MetricsMode` is `kNone` are skipped entirely, avoiding wasted work. - **Integration with existing footer metrics**: Write-side `FieldMetrics` take precedence over footer statistics in `ParquetMetrics::GetMetrics`, so NaN counts are populated while counts/bounds still fall back to footer stats when write-side data isn't available. - **Tests**: `ParquetMetricsTest` now overrides `ReportsNanCounts()` to `true`, and existing NaN test cases verify NaN counts alongside existing value/null count assertions. ## Behavior alignment with Java - Fields nested inside lists/maps do not get NaN metrics (both Java and C++ agree — Java collects then discards; C++ skips collection entirely). - NaN values are excluded from lower/upper bounds in both implementations. - Float/double fields with all-NaN values correctly set `nan_value_count` without setting bounds. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
