[
https://issues.apache.org/jira/browse/ARROW-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney updated ARROW-8681:
--------------------------------
Summary: [Rust][DataFusion] Improve like/nlike performance (was: Improve
like/nlike performance)
> [Rust][DataFusion] Improve like/nlike performance
> -------------------------------------------------
>
> Key: ARROW-8681
> URL: https://issues.apache.org/jira/browse/ARROW-8681
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust, Rust - DataFusion
> Affects Versions: 0.17.0
> Reporter: Zhuang Tianyi
> Priority: Minor
> Labels: performance
>
> Currently, the implementation of `like_utf8` and `nlike_utf8` is based on
> regex, which is simple and readable, but poor at the performance.
>
> I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this
> repo, I compare three like algorithm.
> # `like`(includes partial_like): this is the first naive version, using the
> recursive approach, which will cause terrible performance on special attack
> input, such as `a%a%a%a%a%a%a%a%b`.
> # `like_to_regex`: which is almost the same as the current implementation in
> arrow.
> # `like_optimize`: the like problem is similar to glob in shell, so a
> perfect solution is proposed in [https://research.swtch.com/glob] . The code
> in the research is written golang but I translate it to rust.
>
> In my benchmark result, the recursive solution can be ignored due to bad time
> complexity lower bound.
> the regex solution will cost about 1000x time including regex compiling, and
> about 4x time without regex compiling then solution 3. And It seems that the
> code complexity of solution 3 is acceptable.
> Everyone can reproduce the benchmark result using this repo with a few codes.
>
> I have submitted a PR to TiKV to optimize the like performance
> ([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)],
> without UTF-8 support), and add collation support in
> [https://github.com/tikv/tikv/pull/6592], which can be simply port to
> data-fusion.
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)