Jefffrey commented on code in PR #20208:
URL: https://github.com/apache/datafusion/pull/20208#discussion_r2787966144


##########
datafusion/functions-aggregate/src/percentile_cont.rs:
##########
@@ -58,16 +58,10 @@ use datafusion_macros::user_doc;
 
 use crate::utils::validate_percentile_expr;
 
-/// Precision multiplier for linear interpolation calculations.
-///
-/// This value of 1,000,000 was chosen to balance precision with overflow 
safety:
-/// - Provides 6 decimal places of precision for the fractional component
-/// - Small enough to avoid overflow when multiplied with typical numeric 
values
-/// - Sufficient precision for most statistical applications
-///
-/// The interpolation formula: `lower + (upper - lower) * fraction`
-/// is computed as: `lower + ((upper - lower) * (fraction * PRECISION)) / 
PRECISION`
-/// to avoid floating-point operations on integer types while maintaining 
precision.
+// Keep historical behavior for percentile interpolation by quantizing the
+// fractional component before interpolation. This minimizes output changes for
+// Float32/Float64 while still allowing us to perform the arithmetic in f64
+// (preventing Float16 overflow).

Review Comment:
   Can we keep the original comment here? I don't see why we should replace the 
proper explanation with simply "Keep historical behaviour"



##########
datafusion/functions-aggregate/src/percentile_cont.rs:
##########
@@ -772,22 +783,42 @@ fn calculate_percentile<T: ArrowNumericType>(
             let (_, upper_value, _) = 
values.select_nth_unstable_by(upper_index, cmp);
             let upper_value = *upper_value;
 
-            // Linear interpolation using wrapping arithmetic
-            // We use wrapping operations here (matching the approach in 
median.rs) because:
-            // 1. Both values come from the input data, so diff is bounded by 
the value range
-            // 2. fraction is between 0 and 1, and INTERPOLATION_PRECISION is 
small enough
-            //    to prevent overflow when combined with typical numeric ranges
-            // 3. The result is guaranteed to be between lower_value and 
upper_value
-            // 4. For floating-point types, wrapping ops behave the same as 
standard ops
+            // Linear interpolation (see `INTERPOLATION_PRECISION` comment for 
rationale).
             let fraction = index - (lower_index as f64);
-            let diff = upper_value.sub_wrapping(lower_value);
-            let interpolated = lower_value.add_wrapping(
-                diff.mul_wrapping(T::Native::usize_as(
-                    (fraction * INTERPOLATION_PRECISION as f64) as usize,
-                ))
-                .div_wrapping(T::Native::usize_as(INTERPOLATION_PRECISION)),
-            );
-            Some(interpolated)
+            let scaled = (fraction * INTERPOLATION_PRECISION as f64) as usize;

Review Comment:
   Why are we removing the wrapping versions of the operators?



##########
datafusion/functions-aggregate/src/percentile_cont.rs:
##########
@@ -772,22 +783,42 @@ fn calculate_percentile<T: ArrowNumericType>(
             let (_, upper_value, _) = 
values.select_nth_unstable_by(upper_index, cmp);
             let upper_value = *upper_value;
 
-            // Linear interpolation using wrapping arithmetic
-            // We use wrapping operations here (matching the approach in 
median.rs) because:
-            // 1. Both values come from the input data, so diff is bounded by 
the value range
-            // 2. fraction is between 0 and 1, and INTERPOLATION_PRECISION is 
small enough

Review Comment:
   Same here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to