Joe McDonnell created IMPALA-14753:
--------------------------------------
Summary: SIMD optimizations for string search
Key: IMPALA-14753
URL: https://issues.apache.org/jira/browse/IMPALA-14753
Project: IMPALA
Issue Type: Task
Components: Backend
Affects Versions: Impala 5.0.0
Reporter: Joe McDonnell
In be/src/benchmarks/string-search-benchmark.cc, it compares our current
implementation in be/src/runtime/string-search.h (which came from Python) to
one that calls out to strstr in libc. Running this today, I get these results:
{noformat}
$ bin/run-jvm-binary.sh taskset -c 0 setarch `uname -m` -R
be/build/latest/benchmarks/string-search-benchmark
Machine Info: 12th Gen Intel(R) Core(TM) i9-12900K
String Search: Function iters/ms 10%ile 50%ile 90%ile
10%ile 50%ile 90%ile
(relative) (relative) (relative)
---------------------------------------------------------------------------------------------------------
Python 165 168 169
1X 1X 1X
LibC 1.37e+03 1.39e+03 1.4e+03
8.31X 8.26X 8.25X
Null Terminated SSE 924 932 937
5.59X 5.56X 5.54X
Non-null Terminated SSE 764 774 778
4.63X 4.61X 4.59X{noformat}
This suggests that string search would benefit from SIMD optimizations.
Further, there are libraries that claim to be faster than strstr (e.g.
[https://github.com/ashvardanian/StringZilla] ).
Our current code is based on Cython code in faststring.h:
[https://github.com/python/cpython/blob/main/Objects/stringlib/fastsearch.h]
The Cython code has since added an implementation of Crochemore and Perrin's
(1991) Two-Way algorithm. This can avoid certain pathological cases.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]