clee704 opened a new pull request, #50183: URL: https://github.com/apache/arrow/pull/50183
### Rationale for this change When every value in a `FLOAT`, `DOUBLE`, or `FLOAT16` column is the same infinity, the Parquet writer stored a *finite* min/max statistic instead of the infinity actually present in the column: - all `+Inf` → `min` written as `FLT_MAX` / `DBL_MAX` (should be `+Inf`) - all `-Inf` → `max` written as `-FLT_MAX` / `-DBL_MAX` (should be `-Inf`) The running min/max were seeded with `std::numeric_limits<T>::max()` / `::lowest()` — the largest/smallest *finite* values. These are not identity elements for min/max once the data contains infinities: `Min(FLT_MAX, +Inf)` keeps `FLT_MAX`, so an all-`+Inf` column never displaces the seed (and symmetrically for `-Inf`). The wrong value is then written with `is_min_value_exact = true`, so a reader is told the exact minimum of an all-`+Inf` column is `FLT_MAX` — a value not even present in the column. Range-based row-group filtering stays correct (the bound is only loosened in the safe direction), but engines that read min/max statistics as values (statistics-based `MIN`/`MAX` evaluation, metadata introspection) return wrong results. See GH-50182 for a full reproduction in both PyArrow and C++. ### What changes are included in this PR? - `CompareHelper::DefaultMin()` / `DefaultMax()` now seed floating-point accumulation with `+Inf` / `-Inf` (the true identities) instead of the largest finite values; integral types are unchanged. - The same change is applied to the `Float16` comparator, via new `Float16Constants::positive_infinity()` / `negative_infinity()`. - The "no valid value was seen" detection in `CleanStatistic` / `CleanFloat16Statistic` (which relied on the old finite seeds) is updated to match the new `±Inf` seeds. An all-NaN or empty column still leaves `min = +Inf, max = -Inf` (`min > max`, which real data can never produce), so it stays unambiguously detectable and continues to emit no statistics. ### Are these changes tested? Yes. A new typed test `TestFloatStatistics.Infinities` (covering `FLOAT`, `DOUBLE`, and `FLOAT16`) asserts that all-`+Inf`, all-`-Inf`, and mixed `±Inf` columns produce the exact infinity as min/max (the existing `AssertMinMaxAre` helper also checks `is_min_value_exact`/`is_max_value_exact`). The existing all-NaN tests continue to pass, exercising the updated empty/all-NaN detection. ### Are there any user-facing changes? Parquet files written for all-infinity floating-point columns now record the correct `±Inf` min/max statistics instead of the largest finite value. There are no public API changes. **This PR contains a "Critical Fix".** It fixes a bug that produced incorrect data: the writer emitted min/max statistics that are wrong yet flagged exact (`is_min_value_exact = true`), which can yield incorrect results for readers that evaluate `MIN`/`MAX` directly from Parquet statistics. --- *Disclosure: this change was prepared with the assistance of an AI coding tool (GitHub Copilot CLI) — root-cause analysis, the fix, and the test. It was verified locally by building `parquet-internals-test` and running the `TestFloatStatistics` suite, and I have reviewed the diff.* -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
