khwilson commented on PR #44184:
URL: https://github.com/apache/arrow/pull/44184#issuecomment-2414833049

   Two problems with just validating afterward: First, I'd expect in reasonable 
cases for the validation to fail. A sum of 1m decimals of approximately the 
same size you'd expect to have 6 more digits of precision. I assume this is why 
all the DBMSs I looked at increase the precision by default.
   
   Second, just checking for overflow doesn't solve the underlying problem. 
Consider:
   
   ```python
   a = pa.array([789.3] * 18).cast(pa.decimal128(38, 35))
   print(pc.sum(a))
   pc.sum(a).validate(full=True)  # passes
   ```
   
   In duckdb, they implement an intermediate check to make sure that there's 
not an internal overflow:
   
   ```python
   tab = pa.Table.from_pydict({"a": a})
   duckdb.query("select sum(a) from tab")
   # Traceback (most recent call last):
   #   File "<stdin>", line 1, in <module>
   # duckdb.duckdb.OutOfRangeException: Out of Range Error: Overflow in HUGEINT 
addition: 
   # 157859999999999990905052982270717620880 + 
78929999999999995452526491135358810440
   ```
   
   Notably, this lack of overflow checking also applies to integer sums in 
arrow:
   
   ```python
   >>> pa.array([9223372036854775800] * 2, type=pa.int64())
   <pyarrow.lib.Int64Array object at 0x10c1d8b80>
   [
     9223372036854775800,
     9223372036854775800
   ]
   >>> pc.sum(pa.array([9223372036854775800] * 2, type=pa.int64()))
   <pyarrow.Int64Scalar: -16>
   >>> pc.sum(pa.array([9223372036854775800] * 2, 
type=pa.int64())).validate(full=True)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to