cyb70289 opened a new pull request #9635: URL: https://github.com/apache/arrow/pull/9635
Leverage pairwise sum to reduce round-off error from O(n) to O(logn). **NOTE:** This patch hurts sum kernel performance significantly. I don't worry too much as the performance is on par with Numpy. For floating point, up to 75% drop is observed. This is because old code manually unrolls loops which greatly improves performance. But this is something should be avoided. Due to precision limitation, basic math rules doesn't apply to floating points. E.g., `(a+b)+c != a+(b+c)`. Test shows SSE4 and AVX2 summation kernels may give different results (both wrong), simply because they use different unroll steps. [1] I guess this is also the reason why compiler only unroll loops for integers, but not floating points. For integers, depends on compiler, up to 50% drop observed for int32/64, and even bigger gap for int8/int16. Per my preliminary investigation, this is because I replaced BitBlockCounter with VisitSetBitRunsVoid. For some reason, VisitSetBitRunsVoid prohibits the compiler to generate vectorized code, even for non-null case. As VisitSetBitRunsVoid is much easier and natural to use, I prefer keeping it now and putting further optimization as follow-up task. [1] https://issues.apache.org/jira/browse/ARROW-11758 ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org