[ 
https://issues.apache.org/jira/browse/ARROW-9842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308410#comment-17308410
 ] 

Yibo Cai commented on ARROW-9842:
---------------------------------

[GenerateBitsUnrolled|https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bitmap_generate.h#L63]
 is called to do the actual job. Manually written _mm_movemask_epi8 does 
improve compare kernel performance significantly for integers. A quick POC 
patch is attached:  [^movemask.patch] 

*NOTE:* this patch only improves operations with trivial generator (integer 
compare), which is easily vectorizable. For non-trivial generators (string 
compare ), it hurts performance. And GenerateBits benchmark itself also drops.
My gut feeling is it's better to tweak C++ code and let compiler do the dirty 
job.

{noformat}
$ archery benchmark diff 
--suite-filter="arrow-compute-scalar-compare-benchmark" --cc=clang-9 
--cxx=clang++-9

-----------------------------------------------------------------------------------------------
Non-regressions: (17)
-----------------------------------------------------------------------------------------------
                           benchmark            baseline           contender  
change % counters
    GreaterArrayArrayInt64/32768/100    1.889G items/sec    3.119G items/sec    
65.101       {}
      GreaterArrayArrayInt64/32768/0    1.942G items/sec    3.175G items/sec    
63.429       {}
  GreaterArrayArrayInt64/32768/10000    1.901G items/sec    3.102G items/sec    
63.169       {}
     GreaterArrayArrayInt64/32768/10    1.916G items/sec    3.126G items/sec    
63.147       {}
      GreaterArrayArrayInt64/32768/2    1.909G items/sec    3.104G items/sec    
62.651       {}
      GreaterArrayArrayInt64/32768/1    1.923G items/sec    3.128G items/sec    
62.630       {}
    GreaterArrayScalarInt64/32768/10    2.715G items/sec    3.206G items/sec    
18.079       {}
 GreaterArrayScalarInt64/32768/10000    2.700G items/sec    3.176G items/sec    
17.631       {}
     GreaterArrayScalarInt64/32768/1    2.744G items/sec    3.194G items/sec    
16.386       {}
   GreaterArrayScalarInt64/32768/100    2.733G items/sec    3.176G items/sec    
16.192       {}
     GreaterArrayScalarInt64/32768/2    2.746G items/sec    3.184G items/sec    
15.952       {}
     GreaterArrayScalarInt64/32768/0    2.772G items/sec    3.208G items/sec    
15.719       {}
 GreaterArrayArrayString/32768/10000  115.583M items/sec  115.174M items/sec    
-0.354       {}
     GreaterArrayArrayString/32768/0  116.099M items/sec  115.286M items/sec    
-0.700       {}
   GreaterArrayArrayString/32768/100  115.595M items/sec  113.561M items/sec    
-1.760       {}
    GreaterArrayScalarString/32768/2  210.101M items/sec  204.424M items/sec    
-2.702       {}
    GreaterArrayArrayString/32768/10  117.269M items/sec  113.311M items/sec    
-3.375       {}

------------------------------------------------------------------------------------------------
Regressions: (7)
------------------------------------------------------------------------------------------------
                            benchmark            baseline           contender  
change % counters
      GreaterArrayArrayString/32768/2  136.027M items/sec  126.051M items/sec   
 -7.334       {}
     GreaterArrayScalarString/32768/0  603.706M items/sec  524.915M items/sec   
-13.051       {}
 GreaterArrayScalarString/32768/10000  603.305M items/sec  521.166M items/sec   
-13.615       {}
   GreaterArrayScalarString/32768/100  580.450M items/sec  497.031M items/sec   
-14.371       {}
    GreaterArrayScalarString/32768/10  437.081M items/sec  368.205M items/sec   
-15.758       {}
     GreaterArrayScalarString/32768/1  812.923M items/sec  675.911M items/sec   
-16.854       {}
      GreaterArrayArrayString/32768/1  613.903M items/sec  441.708M items/sec   
-28.049       {}

{noformat}


> [C++] Explore alternative strategy for Compare kernel implementation for 
> better performance
> -------------------------------------------------------------------------------------------
>
>                 Key: ARROW-9842
>                 URL: https://issues.apache.org/jira/browse/ARROW-9842
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 5.0.0
>
>         Attachments: movemask.patch
>
>
> The compiler may be able to vectorize comparison options if the bitpacking of 
> results is deferred until the end (or in chunks). Instead, a temporary 
> bytemap can be populated on a chunk-by-chunk basis and then the bytemaps can 
> be bitpacked into the output buffer. This may also reduce the code size of 
> the compare kernels (which are actually quite large at the moment)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to