[GitHub] [arrow-rs] alamb opened a new issue #71: [Rust][DataFusion] Improve like/nlike performance

GitBox Mon, 26 Apr 2021 05:31:10 -0700


alamb opened a new issue #71:
URL: https://github.com/apache/arrow-rs/issues/71

*Note*: migrated from original JIRA:
https://issues.apache.org/jira/browse/ARROW-8681

Currently, the implementation of `like_utf8` and `nlike_utf8` is based on
regex, which is simple and readable, but poor at the performance.

I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in
this repo, I compare three like algorithm.
# `like`(includes partial_like): this is the first naive version, using the
recursive approach, which will cause terrible performance on special attack
input, such as `a%a%a%a%a%a%a%a%b`.
# `like_to_regex`: which is almost the same as the current implementation
in arrow.
# `like_optimize`: the like problem is similar to glob in shell, so a
perfect solution is proposed in [https://research.swtch.com/glob] . The code in
the research is written golang but I translate it to rust.

In my benchmark result, the recursive solution can be ignored due to bad
time complexity lower bound.

the regex solution will cost about 1000x time including regex compiling, and
about 4x time without regex compiling then solution 3. And It seems that the
code complexity of solution 3 is acceptable.

Everyone can reproduce the benchmark result using this repo with a few codes.

I have submitted a PR to TiKV to optimize the like performance
([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)],
without UTF-8 support), and add collation support in
[https://github.com/tikv/tikv/pull/6592], which can be simply port to
data-fusion.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb opened a new issue #71: [Rust][DataFusion] Improve like/nlike performance

Reply via email to