cyb70289 opened a new pull request #9635:
URL: https://github.com/apache/arrow/pull/9635


   Leverage pairwise sum to reduce round-off error from O(n) to O(logn).
   
   **NOTE:** This patch hurts sum kernel performance significantly.
   I don't worry too much as the performance is on par with Numpy.
   
   For floating point, up to 75% drop is observed. This is because old code
   manually unrolls loops which greatly improves performance. But this is
   something should be avoided. Due to precision limitation, basic math
   rules doesn't apply to floating points. E.g., `(a+b)+c != a+(b+c)`. Test
   shows SSE4 and AVX2 summation kernels may give different results (both
   wrong), simply because they use different unroll steps. [1]
   I guess this is also the reason why compiler only unroll loops for
   integers, but not floating points.
   
   For integers, depends on compiler, up to 50% drop observed for int32/64,
   and even bigger gap for int8/int16. Per my preliminary investigation,
   this is because I replaced BitBlockCounter with VisitSetBitRunsVoid.
   For some reason, VisitSetBitRunsVoid prohibits the compiler to generate
   vectorized code, even for non-null case. As VisitSetBitRunsVoid is much
   easier and natural to use, I prefer keeping it now and putting further
   optimization as follow-up task.
   
   [1] https://issues.apache.org/jira/browse/ARROW-11758


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to