clee704 commented on code in PR #50183:
URL: https://github.com/apache/arrow/pull/50183#discussion_r3415458864


##########
cpp/src/parquet/statistics.cc:
##########
@@ -357,7 +383,10 @@ CleanStatistic(std::pair<T, T> min_max, 
LogicalType::Type::type) {
     return ::std::nullopt;
   }
 
-  if (min == std::numeric_limits<T>::max() && max == 
std::numeric_limits<T>::lowest()) {
+  // No valid (non-NaN) value was seen: the running min/max still hold the
+  // +/-infinity seeds (note min > max here, which real data can never 
produce).
+  if (min == std::numeric_limits<T>::infinity() &&
+      max == -std::numeric_limits<T>::infinity()) {
     return ::std::nullopt;
   }

Review Comment:
   This overload of `CleanStatistic` is SFINAE-constrained to floating-point 
`T` (`enable_if_t<std::is_floating_point<T>::value>`), so it is never 
instantiated for integral types — the integral overload just above returns 
`min_max` unchanged, and an all-zeros integer column never reaches this check. 
That said, I have removed the hardcoded infinity sentinel entirely: the 
detection now uses a `IsUninitializedMinMax(min, max)` helper (`max < min`) 
that depends only on the `DefaultMin > DefaultMax` invariant rather than any 
specific seed value.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -79,8 +89,24 @@ struct CompareHelper {
   static_assert(!std::is_unsigned<T>::value || std::is_same<T, bool>::value,
                 "T is an unsigned numeric");
 
-  constexpr static T DefaultMin() { return std::numeric_limits<T>::max(); }
-  constexpr static T DefaultMax() { return std::numeric_limits<T>::lowest(); }
+  // For floating point, seed the running min/max with +/-infinity rather than
+  // the largest/smallest finite value: a finite seed is not an identity for
+  // min/max once the data contains infinities, so an all-+Inf (resp. all--Inf)
+  // column would otherwise never displace the seed and report a finite bound.
+  constexpr static T DefaultMin() {
+    if constexpr (std::is_floating_point_v<T>) {
+      return std::numeric_limits<T>::infinity();
+    } else {
+      return std::numeric_limits<T>::max();
+    }
+  }
+  constexpr static T DefaultMax() {
+    if constexpr (std::is_floating_point_v<T>) {
+      return -std::numeric_limits<T>::infinity();
+    } else {
+      return std::numeric_limits<T>::lowest();
+    }
+  }

Review Comment:
   Good call — factored the detection into a single `IsUninitializedMinMax(min, 
max)` helper reused by both the float and Float16 cleaning paths. It returns 
`max < min`, which holds only for the uninitialized seeds (real, non-NaN data 
always has `min <= max`), so the seeds (`DefaultMin`/`DefaultMax`) and their 
detection can no longer drift apart.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to