[
https://issues.apache.org/jira/browse/ARROW-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302281#comment-17302281
]
Yibo Cai commented on ARROW-11758:
----------------------------------
For record.
Arrow's pairwise sum implementation is a bit different from numpy. Numpy
adopts recursion with manually unrolled loops for last 16 summations. Arrow
code is non-recursive, and no unroll. Though equivalent in theory, they are
different for floating point arithmetic in reality. There may be gaps for large
array summations.
Actually, neither numpy nor pyarrow can give accurate results. The point is
they both guarantees O(logn) round-off error bound.
In below test, pyarrow gives the accurate result. Of course, there must be some
tests in favor of numpy. The truth value is from mpmath arbitrary precision lib.
*Test code*
{code:python}
import numpy as np
import pyarrow.compute as pc
import mpmath
mpmath.dps = 30
t = np.arange(123456, 654321, dtype='float64') * 987654321
print(' numpy: %.0f' % np.sum(t))
print(' arrow: %.0f' % pc.sum(t).as_py())
print('mpmath: %.0f' % mpmath.fsum(t))
{code}
*Result*
{noformat}
numpy: 203898299380326662144
arrow: 203898299380326498304
mpmath: 203898299380326498304
{noformat}
> [C++][Compute] Summation kernel round-off error
> -----------------------------------------------
>
> Key: ARROW-11758
> URL: https://issues.apache.org/jira/browse/ARROW-11758
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Reporter: Yibo Cai
> Assignee: Yibo Cai
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
> Time Spent: 3.5h
> Remaining Estimate: 0h
>
> From below test, summation kernel is of lower precision than numpy.sum.
> Numpy implements pairwise summation [1] with O(logn) round-off error, better
> than O(n\) error from naive summation.
> *sum.py*
> {code:python}
> import numpy as np
> import pyarrow.compute as pc
> t = np.arange(321000, dtype='float64')
> t2 = t - np.mean(t)
> t2 *= t2
> print('numpy sum:', np.sum(t2))
> print('arrow sum:', pc.sum(t2))
> {code}
> *test result*
> {noformat}
> # Verified with wolfram alpha (arbitrary precision), Numpy's result is
> correct.
> $ ARROW_USER_SIMD_LEVEL=SSE4_2 python sum.py
> numpy sum: 2756346749973250.0
> arrow sum: 2756346749973248.0
> $ ARROW_USER_SIMD_LEVEL=AVX2 python sum.py
> numpy sum: 2756346749973250.0
> arrow sum: 2756346749973249.0
> {noformat}
> [1] https://en.wikipedia.org/wiki/Pairwise_summation
--
This message was sent by Atlassian Jira
(v8.3.4#803005)