Joe McDonnell created IMPALA-14753:
--------------------------------------

             Summary: SIMD optimizations for string search
                 Key: IMPALA-14753
                 URL: https://issues.apache.org/jira/browse/IMPALA-14753
             Project: IMPALA
          Issue Type: Task
          Components: Backend
    Affects Versions: Impala 5.0.0
            Reporter: Joe McDonnell


In be/src/benchmarks/string-search-benchmark.cc, it compares our current 
implementation in be/src/runtime/string-search.h (which came from Python) to 
one that calls out to strstr in libc. Running this today, I get these results:
{noformat}
$ bin/run-jvm-binary.sh taskset -c 0 setarch `uname -m` -R 
be/build/latest/benchmarks/string-search-benchmark 
Machine Info: 12th Gen Intel(R) Core(TM) i9-12900K
String Search:             Function  iters/ms   10%ile   50%ile   90%ile     
10%ile     50%ile     90%ile
                                                                         
(relative) (relative) (relative)
---------------------------------------------------------------------------------------------------------
                             Python                165      168      169        
 1X         1X         1X
                               LibC           1.37e+03 1.39e+03  1.4e+03      
8.31X      8.26X      8.25X
                Null Terminated SSE                924      932      937      
5.59X      5.56X      5.54X
            Non-null Terminated SSE                764      774      778      
4.63X      4.61X      4.59X{noformat}
This suggests that string search would benefit from SIMD optimizations. 
Further, there are libraries that claim to be faster than strstr (e.g. 
[https://github.com/ashvardanian/StringZilla] ).

Our current code is based on Cython code in faststring.h: 
[https://github.com/python/cpython/blob/main/Objects/stringlib/fastsearch.h]

The Cython code has since added an implementation of Crochemore and Perrin's 
(1991) Two-Way algorithm. This can avoid certain pathological cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to