Copilot commented on code in PR #50183:
URL: https://github.com/apache/arrow/pull/50183#discussion_r3415386218


##########
cpp/src/parquet/statistics.cc:
##########
@@ -357,7 +383,10 @@ CleanStatistic(std::pair<T, T> min_max, 
LogicalType::Type::type) {
     return ::std::nullopt;
   }
 
-  if (min == std::numeric_limits<T>::max() && max == 
std::numeric_limits<T>::lowest()) {
+  // No valid (non-NaN) value was seen: the running min/max still hold the
+  // +/-infinity seeds (note min > max here, which real data can never 
produce).
+  if (min == std::numeric_limits<T>::infinity() &&
+      max == -std::numeric_limits<T>::infinity()) {
     return ::std::nullopt;
   }

Review Comment:
   This sentinel check is now incorrect for non-floating-point instantiations 
of `CleanStatistic<T>`: for integral types, 
`std::numeric_limits<T>::infinity()` yields `T{0}`, so a real column with all 
zeros (min==0 && max==0) would be incorrectly treated as 'no valid values' and 
return nullopt. Fix by making this check conditional (e.g., `if constexpr 
(std::is_floating_point_v<T>)` use ±infinity; else keep the previous 
`max()`/`lowest()` sentinel check), or compare against the same sentinel values 
used to seed min/max for that `T`.



##########
cpp/src/parquet/statistics.cc:
##########
@@ -79,8 +89,24 @@ struct CompareHelper {
   static_assert(!std::is_unsigned<T>::value || std::is_same<T, bool>::value,
                 "T is an unsigned numeric");
 
-  constexpr static T DefaultMin() { return std::numeric_limits<T>::max(); }
-  constexpr static T DefaultMax() { return std::numeric_limits<T>::lowest(); }
+  // For floating point, seed the running min/max with +/-infinity rather than
+  // the largest/smallest finite value: a finite seed is not an identity for
+  // min/max once the data contains infinities, so an all-+Inf (resp. all--Inf)
+  // column would otherwise never displace the seed and report a finite bound.
+  constexpr static T DefaultMin() {
+    if constexpr (std::is_floating_point_v<T>) {
+      return std::numeric_limits<T>::infinity();
+    } else {
+      return std::numeric_limits<T>::max();
+    }
+  }
+  constexpr static T DefaultMax() {
+    if constexpr (std::is_floating_point_v<T>) {
+      return -std::numeric_limits<T>::infinity();
+    } else {
+      return std::numeric_limits<T>::lowest();
+    }
+  }

Review Comment:
   The 'default seed' logic is now split between 
`CompareHelper::DefaultMin/DefaultMax` and 
`CleanStatistic`/`CleanFloat16Statistic` sentinel checks. To prevent future 
drift (as happened here), consider centralizing the sentinel comparison by 
reusing the same helper (e.g., basing cleaning on the same conditional as 
DefaultMin/DefaultMax, or factoring out a small `IsUninitializedMinMax(min, 
max)` helper).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to