clee704 opened a new pull request, #50183:
URL: https://github.com/apache/arrow/pull/50183

   ### Rationale for this change
   
   When every value in a `FLOAT`, `DOUBLE`, or `FLOAT16` column is the same 
infinity, the Parquet writer stored a *finite* min/max statistic instead of the 
infinity actually present in the column:
   
   - all `+Inf` → `min` written as `FLT_MAX` / `DBL_MAX` (should be `+Inf`)
   - all `-Inf` → `max` written as `-FLT_MAX` / `-DBL_MAX` (should be `-Inf`)
   
   The running min/max were seeded with `std::numeric_limits<T>::max()` / 
`::lowest()` — the largest/smallest *finite* values. These are not identity 
elements for min/max once the data contains infinities: `Min(FLT_MAX, +Inf)` 
keeps `FLT_MAX`, so an all-`+Inf` column never displaces the seed (and 
symmetrically for `-Inf`). The wrong value is then written with 
`is_min_value_exact = true`, so a reader is told the exact minimum of an 
all-`+Inf` column is `FLT_MAX` — a value not even present in the column. 
Range-based row-group filtering stays correct (the bound is only loosened in 
the safe direction), but engines that read min/max statistics as values 
(statistics-based `MIN`/`MAX` evaluation, metadata introspection) return wrong 
results.
   
   See GH-50182 for a full reproduction in both PyArrow and C++.
   
   ### What changes are included in this PR?
   
   - `CompareHelper::DefaultMin()` / `DefaultMax()` now seed floating-point 
accumulation with `+Inf` / `-Inf` (the true identities) instead of the largest 
finite values; integral types are unchanged.
   - The same change is applied to the `Float16` comparator, via new 
`Float16Constants::positive_infinity()` / `negative_infinity()`.
   - The "no valid value was seen" detection in `CleanStatistic` / 
`CleanFloat16Statistic` (which relied on the old finite seeds) is updated to 
match the new `±Inf` seeds. An all-NaN or empty column still leaves `min = 
+Inf, max = -Inf` (`min > max`, which real data can never produce), so it stays 
unambiguously detectable and continues to emit no statistics.
   
   ### Are these changes tested?
   
   Yes. A new typed test `TestFloatStatistics.Infinities` (covering `FLOAT`, 
`DOUBLE`, and `FLOAT16`) asserts that all-`+Inf`, all-`-Inf`, and mixed `±Inf` 
columns produce the exact infinity as min/max (the existing `AssertMinMaxAre` 
helper also checks `is_min_value_exact`/`is_max_value_exact`). The existing 
all-NaN tests continue to pass, exercising the updated empty/all-NaN detection.
   
   ### Are there any user-facing changes?
   
   Parquet files written for all-infinity floating-point columns now record the 
correct `±Inf` min/max statistics instead of the largest finite value. There 
are no public API changes.
   
   **This PR contains a "Critical Fix".** It fixes a bug that produced 
incorrect data: the writer emitted min/max statistics that are wrong yet 
flagged exact (`is_min_value_exact = true`), which can yield incorrect results 
for readers that evaluate `MIN`/`MAX` directly from Parquet statistics.
   
   ---
   
   *Disclosure: this change was prepared with the assistance of an AI coding 
tool (GitHub Copilot CLI) — root-cause analysis, the fix, and the test. It was 
verified locally by building `parquet-internals-test` and running the 
`TestFloatStatistics` suite, and I have reviewed the diff.*
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to