clee704 opened a new issue, #50182:
URL: https://github.com/apache/arrow/issues/50182
### Describe the bug, including details regarding any error messages,
version, and platform.
When every value in a `FLOAT`, `DOUBLE`, or `FLOAT16` (HALF_FLOAT) column is
the **same infinity**, the Parquet writer stores a **finite** min/max statistic
(the largest representable finite value) instead of the infinity that is
actually in the column.
- An all-`+Inf` column gets **`min` = `FLT_MAX`/`DBL_MAX`** (should be
`+Inf`). `max` is correctly `+Inf`.
- An all-`-Inf` column gets **`max` = `-FLT_MAX`/`-DBL_MAX`** (should be
`-Inf`). `min` is correctly `-Inf`.
#### Reproduction (PyArrow)
```python
import io
import pyarrow as pa
import pyarrow.parquet as pq
inf = float("inf")
for name, vals in [("all +Inf", [inf, inf, inf]), ("all -Inf", [-inf, -inf,
-inf])]:
buf = io.BytesIO()
pq.write_table(pa.table({"c": pa.array(vals, type=pa.float64())}), buf)
st = pq.ParquetFile(buf).metadata.row_group(0).column(0).statistics
print(f"{name}: min={st.min!r} max={st.max!r}")
```
Output (PyArrow 24.0.0):
```
all +Inf: min=1.7976931348623157e+308 max=inf
all -Inf: min=-inf max=-1.7976931348623157e+308
```
Expected: `all +Inf -> min=inf, max=inf` and `all -Inf -> min=-inf,
max=-inf`.
`FLOAT` and `FLOAT16` behave the same way (an all-`+Inf` half-float column
reports `min = 65504.0`, the largest finite half).
I also confirmed this at the C++ level (no PyArrow) against `libparquet`
24.0.0, both through the writer's comparator directly and end-to-end via a real
Parquet file:
```
TypedComparator::GetMinMax:
double all +Inf -> min=1.7976931348623157e+308 max=inf
double all -Inf -> min=-inf
max=-1.7976931348623157e+308
on-disk (write file, read statistics back):
double all +Inf -> min=DBL_MAX (WRONG) max=inf
double all -Inf -> min=-inf max=-DBL_MAX (WRONG)
double finite -> min=1 max=3 (control: correct)
double +/-Inf -> min=-inf max=inf (control: correct)
```
#### Root cause
In `cpp/src/parquet/statistics.cc`, the floating-point comparator seeds the
min/max accumulation with the largest/smallest **finite** values:
```cpp
constexpr static T DefaultMin() { return std::numeric_limits<T>::max(); }
// FLT_MAX / DBL_MAX
constexpr static T DefaultMax() { return std::numeric_limits<T>::lowest(); }
// -FLT_MAX / -DBL_MAX
```
and folds each value into them in `TypedComparatorImpl::GetMinMax` /
`GetMinMaxSpaced`:
```cpp
T min = Helper::DefaultMin();
T max = Helper::DefaultMax();
for (...) {
min = Helper::Min(type_length_, min, Helper::Coalesce(val,
Helper::DefaultMin()));
max = Helper::Max(type_length_, max, Helper::Coalesce(val,
Helper::DefaultMax()));
}
```
These finite sentinels are **not identity elements** for min/max once `±Inf`
is present in the data. `Min(FLT_MAX, +Inf)` returns `FLT_MAX` (because
`FLT_MAX < +Inf`), so the seed is never displaced when every value is `+Inf`;
symmetrically, `Max(-FLT_MAX, -Inf)` returns `-FLT_MAX`. `CleanStatistic()`
only special-cases NaN and signed zero, so it does not correct this.
`Float16CompareHelper` has the same flaw (`DefaultMin()` = `Float16::max()`,
`DefaultMax()` = `Float16::lowest()`).
The compute `min_max` aggregate kernel is **not** affected
(`pa.compute.min_max(pa.array([inf, inf, inf]))` correctly returns `{min: inf,
max: inf}`); this is specific to the Parquet writer's comparator.
Crucially, the resulting (wrong) value is then written as **exact**:
`TypedStatisticsImpl::SetMinMaxPair` unconditionally sets
```cpp
statistics_.is_min_value_exact = true;
statistics_.is_max_value_exact = true;
```
so the file records `min = DBL_MAX` *and* asserts it is the exact minimum.
#### Impact
The truncation always moves the bound in the conservative direction (min
lowered toward finite, max raised toward finite), so range-based row-group
filtering / predicate pushdown remains **safe** — no matching rows are skipped.
The damage is to consumers that use the statistic as a *value* rather than a
range bound: because the stored min/max is wrong yet flagged
`is_min_value_exact = true`, a spec-compliant reader is told the exact minimum
of an all-`+Inf` column is `DBL_MAX` (a value not even present in the column).
Concretely this gives **wrong query results** for engines that answer
`MIN(col)` / `MAX(col)` directly from Parquet statistics (e.g. Spark's Parquet
aggregate push-down, `spark.sql.parquet.aggregatePushdown`), and a wrong value
for any metadata/introspection tooling.
#### Suggested fix
Seed the floating-point accumulation with the true identities — `+infinity`
for the running min and `-infinity` for the running max (and the half-float
equivalents) — rather than the largest finite values; or track whether any
non-NaN value has been observed. Note this also requires updating the "no valid
value" detection in `CleanStatistic()` (currently `min ==
numeric_limits<T>::max() && max == numeric_limits<T>::lowest()` → return
`nullopt`) to match the new sentinels. A genuine all-`+Inf` column has `min ==
+Inf, max == +Inf`, which stays distinguishable from the empty/all-NaN case
(`min == +Inf, max == -Inf`).
Confirmed present on `main` (the relevant code is unchanged from 24.0.0).
### Component(s)
C++, Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]