kosiew commented on code in PR #22714:
URL: https://github.com/apache/datafusion/pull/22714#discussion_r3361221264


##########
datafusion/functions-aggregate/src/average.rs:
##########
@@ -223,50 +233,48 @@ impl AggregateUDFImpl for Avg {
             match (&data_type, acc_args.return_type()) {
                 (Float64, Float64) => Ok(Box::<AvgAccumulator>::default()),
                 (
-                    Decimal32(sum_precision, sum_scale),
+                    Decimal32(_sum_precision, sum_scale),
                     Decimal32(target_precision, target_scale),
-                ) => Ok(Box::new(DecimalAvgAccumulator::<Decimal32Type> {
-                    sum: None,
-                    count: 0,
-                    sum_scale: *sum_scale,
-                    sum_precision: *sum_precision,
-                    target_precision: *target_precision,
-                    target_scale: *target_scale,
-                })),
+                ) => Ok(Box::new(DecimalAvgAccumulator::<
+                    Decimal32Type,
+                    Decimal64Type,
+                >::new(
+                    *sum_scale,
+                    DECIMAL64_MAX_PRECISION,
+                    *target_precision,
+                    *target_scale,
+                ))),
                 (
-                    Decimal64(sum_precision, sum_scale),
+                    Decimal64(_sum_precision, sum_scale),
                     Decimal64(target_precision, target_scale),
-                ) => Ok(Box::new(DecimalAvgAccumulator::<Decimal64Type> {
-                    sum: None,
-                    count: 0,
-                    sum_scale: *sum_scale,
-                    sum_precision: *sum_precision,
-                    target_precision: *target_precision,
-                    target_scale: *target_scale,
-                })),
+                ) => Ok(Box::new(DecimalAvgAccumulator::<
+                    Decimal64Type,
+                    Decimal128Type,
+                >::new(
+                    *sum_scale,
+                    DECIMAL128_MAX_PRECISION,
+                    *target_precision,
+                    *target_scale,
+                ))),
                 (
                     Decimal128(sum_precision, sum_scale),
                     Decimal128(target_precision, target_scale),
-                ) => Ok(Box::new(DecimalAvgAccumulator::<Decimal128Type> {
-                    sum: None,
-                    count: 0,
-                    sum_scale: *sum_scale,
-                    sum_precision: *sum_precision,
-                    target_precision: *target_precision,
-                    target_scale: *target_scale,
-                })),
+                ) => Ok(Box::new(DecimalAvgAccumulator::<Decimal128Type>::new(

Review Comment:
   I think there is still a correctness issue here for Decimal128 AVG. The 
accumulator state remains `Decimal128Type`, and the new summation path uses 
`add_wrapping` (for example at lines 651, 672, and 947). That means 
intermediate overflow can still silently wrap before `DecimalAverager` gets a 
chance to run.
   
   For example, `avg(arrow_cast('999999999999999999999999999999.9999', 
'Decimal128(34, 4)'))` over roughly 20,000 rows has a valid `Decimal128(38, 8)` 
result, but the intermediate Decimal128 sum exceeds `i128::MAX` and wraps along 
the way. At that point the average is already corrupted even though the final 
result would be representable.
   
   Could we widen the Decimal128 accumulation state (for example to Decimal256) 
or otherwise use checked/compensated accumulation so intermediate overflow does 
not invalidate averages whose final result fits?



##########
datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs:
##########
@@ -271,6 +271,8 @@ pub fn convert_schema_to_types(columns: &Fields) -> 
Vec<DFColumnType> {
             DataType::Float16
             | DataType::Float32
             | DataType::Float64
+            | DataType::Decimal32(_, _)

Review Comment:
   Small suggestion: mapping Decimal32/Decimal64 globally to 
`DFColumnType::Float` makes SLT comparisons approximate for these decimal types 
and could hide formatting or rounding regressions that would otherwise be 
caught.
   
   If the motivation is only the affected `power`/`round` queries, would it 
make sense to keep exact/text comparisons there instead? At minimum, it may be 
worth documenting why all Decimal32/64 SLT output is now treated approximately.



##########
datafusion/functions-aggregate/src/average.rs:
##########
@@ -365,17 +370,27 @@ impl AggregateUDFImpl for Avg {
                 Decimal32(_sum_precision, sum_scale),

Review Comment:
   Small refactoring suggestion: the Decimal32 and Decimal64 branches both 
follow the same pattern of creating a wider `DecimalAverager`, dividing by a 
widened count, and then `try_from`-ing back to the output native type.
   
   It might be nice to factor that into a helper such as 
`avg_decimal_with_wider_sum`. That would encode the widening invariant in one 
place and help keep the accumulator and group-accumulator paths aligned over 
time.



##########
datafusion/functions-aggregate/src/average.rs:
##########


Review Comment:
   I think the same issue still exists for `AVG(DISTINCT)` on decimals. The 
distinct path constructs `DecimalDistinctAvgAccumulator::<Decimal32Type>`, 
`Decimal64Type`, and `Decimal128Type`, and those accumulators still sum 
distinct values in the native type using wrapping arithmetic.
   
   `AVG(DISTINCT)` is still an average, so it can hit the same intermediate 
overflow problem whenever the average is representable but the sum is not. 
Could the widening/state-type fix be applied here as well? Otherwise it would 
be good to explicitly narrow the supported contract and add tests that document 
the unsupported behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to