Jefffrey commented on code in PR #20208:
URL: https://github.com/apache/datafusion/pull/20208#discussion_r2787966144
##########
datafusion/functions-aggregate/src/percentile_cont.rs:
##########
@@ -58,16 +58,10 @@ use datafusion_macros::user_doc;
use crate::utils::validate_percentile_expr;
-/// Precision multiplier for linear interpolation calculations.
-///
-/// This value of 1,000,000 was chosen to balance precision with overflow
safety:
-/// - Provides 6 decimal places of precision for the fractional component
-/// - Small enough to avoid overflow when multiplied with typical numeric
values
-/// - Sufficient precision for most statistical applications
-///
-/// The interpolation formula: `lower + (upper - lower) * fraction`
-/// is computed as: `lower + ((upper - lower) * (fraction * PRECISION)) /
PRECISION`
-/// to avoid floating-point operations on integer types while maintaining
precision.
+// Keep historical behavior for percentile interpolation by quantizing the
+// fractional component before interpolation. This minimizes output changes for
+// Float32/Float64 while still allowing us to perform the arithmetic in f64
+// (preventing Float16 overflow).
Review Comment:
Can we keep the original comment here? I don't see why we should replace the
proper explanation with simply "Keep historical behaviour"
##########
datafusion/functions-aggregate/src/percentile_cont.rs:
##########
@@ -772,22 +783,42 @@ fn calculate_percentile<T: ArrowNumericType>(
let (_, upper_value, _) =
values.select_nth_unstable_by(upper_index, cmp);
let upper_value = *upper_value;
- // Linear interpolation using wrapping arithmetic
- // We use wrapping operations here (matching the approach in
median.rs) because:
- // 1. Both values come from the input data, so diff is bounded by
the value range
- // 2. fraction is between 0 and 1, and INTERPOLATION_PRECISION is
small enough
- // to prevent overflow when combined with typical numeric ranges
- // 3. The result is guaranteed to be between lower_value and
upper_value
- // 4. For floating-point types, wrapping ops behave the same as
standard ops
+ // Linear interpolation (see `INTERPOLATION_PRECISION` comment for
rationale).
let fraction = index - (lower_index as f64);
- let diff = upper_value.sub_wrapping(lower_value);
- let interpolated = lower_value.add_wrapping(
- diff.mul_wrapping(T::Native::usize_as(
- (fraction * INTERPOLATION_PRECISION as f64) as usize,
- ))
- .div_wrapping(T::Native::usize_as(INTERPOLATION_PRECISION)),
- );
- Some(interpolated)
+ let scaled = (fraction * INTERPOLATION_PRECISION as f64) as usize;
Review Comment:
Why are we removing the wrapping versions of the operators?
##########
datafusion/functions-aggregate/src/percentile_cont.rs:
##########
@@ -772,22 +783,42 @@ fn calculate_percentile<T: ArrowNumericType>(
let (_, upper_value, _) =
values.select_nth_unstable_by(upper_index, cmp);
let upper_value = *upper_value;
- // Linear interpolation using wrapping arithmetic
- // We use wrapping operations here (matching the approach in
median.rs) because:
- // 1. Both values come from the input data, so diff is bounded by
the value range
- // 2. fraction is between 0 and 1, and INTERPOLATION_PRECISION is
small enough
Review Comment:
Same here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]