clee704 opened a new issue, #50182:
URL: https://github.com/apache/arrow/issues/50182

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When every value in a `FLOAT`, `DOUBLE`, or `FLOAT16` (HALF_FLOAT) column is 
the **same infinity**, the Parquet writer stores a **finite** min/max statistic 
(the largest representable finite value) instead of the infinity that is 
actually in the column.
   
   - An all-`+Inf` column gets **`min` = `FLT_MAX`/`DBL_MAX`** (should be 
`+Inf`). `max` is correctly `+Inf`.
   - An all-`-Inf` column gets **`max` = `-FLT_MAX`/`-DBL_MAX`** (should be 
`-Inf`). `min` is correctly `-Inf`.
   
   #### Reproduction (PyArrow)
   
   ```python
   import io
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   inf = float("inf")
   for name, vals in [("all +Inf", [inf, inf, inf]), ("all -Inf", [-inf, -inf, 
-inf])]:
       buf = io.BytesIO()
       pq.write_table(pa.table({"c": pa.array(vals, type=pa.float64())}), buf)
       st = pq.ParquetFile(buf).metadata.row_group(0).column(0).statistics
       print(f"{name}: min={st.min!r}  max={st.max!r}")
   ```
   
   Output (PyArrow 24.0.0):
   
   ```
   all +Inf: min=1.7976931348623157e+308  max=inf
   all -Inf: min=-inf  max=-1.7976931348623157e+308
   ```
   
   Expected: `all +Inf -> min=inf, max=inf` and `all -Inf -> min=-inf, 
max=-inf`.
   
   `FLOAT` and `FLOAT16` behave the same way (an all-`+Inf` half-float column 
reports `min = 65504.0`, the largest finite half).
   
   I also confirmed this at the C++ level (no PyArrow) against `libparquet` 
24.0.0, both through the writer's comparator directly and end-to-end via a real 
Parquet file:
   
   ```
   TypedComparator::GetMinMax:
     double  all +Inf  -> min=1.7976931348623157e+308   max=inf
     double  all -Inf  -> min=-inf                      
max=-1.7976931348623157e+308
   on-disk (write file, read statistics back):
     double  all +Inf  -> min=DBL_MAX  (WRONG)          max=inf
     double  all -Inf  -> min=-inf                      max=-DBL_MAX  (WRONG)
     double  finite    -> min=1  max=3                  (control: correct)
     double  +/-Inf    -> min=-inf  max=inf             (control: correct)
   ```
   
   #### Root cause
   
   In `cpp/src/parquet/statistics.cc`, the floating-point comparator seeds the 
min/max accumulation with the largest/smallest **finite** values:
   
   ```cpp
   constexpr static T DefaultMin() { return std::numeric_limits<T>::max(); }    
// FLT_MAX / DBL_MAX
   constexpr static T DefaultMax() { return std::numeric_limits<T>::lowest(); } 
// -FLT_MAX / -DBL_MAX
   ```
   
   and folds each value into them in `TypedComparatorImpl::GetMinMax` / 
`GetMinMaxSpaced`:
   
   ```cpp
   T min = Helper::DefaultMin();
   T max = Helper::DefaultMax();
   for (...) {
     min = Helper::Min(type_length_, min, Helper::Coalesce(val, 
Helper::DefaultMin()));
     max = Helper::Max(type_length_, max, Helper::Coalesce(val, 
Helper::DefaultMax()));
   }
   ```
   
   These finite sentinels are **not identity elements** for min/max once `±Inf` 
is present in the data. `Min(FLT_MAX, +Inf)` returns `FLT_MAX` (because 
`FLT_MAX < +Inf`), so the seed is never displaced when every value is `+Inf`; 
symmetrically, `Max(-FLT_MAX, -Inf)` returns `-FLT_MAX`. `CleanStatistic()` 
only special-cases NaN and signed zero, so it does not correct this.
   
   `Float16CompareHelper` has the same flaw (`DefaultMin()` = `Float16::max()`, 
`DefaultMax()` = `Float16::lowest()`).
   
   The compute `min_max` aggregate kernel is **not** affected 
(`pa.compute.min_max(pa.array([inf, inf, inf]))` correctly returns `{min: inf, 
max: inf}`); this is specific to the Parquet writer's comparator.
   
   Crucially, the resulting (wrong) value is then written as **exact**: 
`TypedStatisticsImpl::SetMinMaxPair` unconditionally sets
   
   ```cpp
   statistics_.is_min_value_exact = true;
   statistics_.is_max_value_exact = true;
   ```
   
   so the file records `min = DBL_MAX` *and* asserts it is the exact minimum.
   
   #### Impact
   
   The truncation always moves the bound in the conservative direction (min 
lowered toward finite, max raised toward finite), so range-based row-group 
filtering / predicate pushdown remains **safe** — no matching rows are skipped. 
The damage is to consumers that use the statistic as a *value* rather than a 
range bound: because the stored min/max is wrong yet flagged 
`is_min_value_exact = true`, a spec-compliant reader is told the exact minimum 
of an all-`+Inf` column is `DBL_MAX` (a value not even present in the column). 
Concretely this gives **wrong query results** for engines that answer 
`MIN(col)` / `MAX(col)` directly from Parquet statistics (e.g. Spark's Parquet 
aggregate push-down, `spark.sql.parquet.aggregatePushdown`), and a wrong value 
for any metadata/introspection tooling.
   
   #### Suggested fix
   
   Seed the floating-point accumulation with the true identities — `+infinity` 
for the running min and `-infinity` for the running max (and the half-float 
equivalents) — rather than the largest finite values; or track whether any 
non-NaN value has been observed. Note this also requires updating the "no valid 
value" detection in `CleanStatistic()` (currently `min == 
numeric_limits<T>::max() && max == numeric_limits<T>::lowest()` → return 
`nullopt`) to match the new sentinels. A genuine all-`+Inf` column has `min == 
+Inf, max == +Inf`, which stays distinguishable from the empty/all-NaN case 
(`min == +Inf, max == -Inf`).
   
   Confirmed present on `main` (the relevant code is unchanged from 24.0.0).
   
   ### Component(s)
   
   C++, Parquet
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to