[jira] [Updated] (ARROW-8681) Improve like/nlike performance

Zhuang Tianyi (Jira) Sun, 03 May 2020 14:08:23 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-8681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhuang Tianyi updated ARROW-8681:
---------------------------------
    Description: 
Currently, the implementation of  `like_utf8` and `nlike_utf8` is based on 
regex, which is simple and readable, but poor at the performance.

 

I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this 
repo, I compare three like algorithm.
 # `like`(includes partial_like): this is the first naive version, using the 
recursive approach, which will cause terrible performance on special attack 
input, such as `a%a%a%a%a%a%a%a%b`.
 # `like_to_regex`: which is almost the same as the current implementation in 
arrow.
 # `like_optimize`: the like problem is similar to glob in shell, so a perfect 
solution is proposed in [https://research.swtch.com/glob] . The code in the 
research is written golang but I translate it to rust.

 

In my benchmark result, the recursive solution can be ignored due to bad time 
complexity lower bound.

the regex solution will cost about 1000x time including regex compiling, and 
about 4x time without regex compiling then solution 3. And It seems that the 
code complexity of solution 3 is acceptable.

Everyone can reproduce the benchmark result using this repo with a few codes.

 

I have submitted a PR to TiKV to optimize the like performance 
([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)],
 without UTF-8 support), and add collation support in 
[https://github.com/tikv/tikv/pull/6592], which can be simply port to 
data-fusion.

 

 

 

 

  was:
Currently, the implementation of  `like_utf8` and `nlike_utf8` is based on 
regex, which is simple and readable, but poor at the performance.

 

I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this 
repo, I compare three like algorithm.
 # `like`(includes partial_like): this is the first naive version, using the 
recursive approach, which will cause terrible performance on special attack 
input, such as `a%a%a%a%a%a%a%a%b`.
 # `like_to_regex`: which is almost the same as the current implementation in 
arrow.
 # `like_optimize`: the like problem is similar to glob in shell, so a perfect 
solution is proposed in [https://research.swtch.com/glob] . The code in the 
research is written golang but I translate it to rust.

 

In my benchmark result, the recursive solution can be ignored due to bad time 
complexity lower bound.

the regex solution will cost about 1000x time including regex compiling, and 
about 4x time without regex compiling then solution 3. And It seems that the 
code complexity is acceptable.

Everyone can reproduce the benchmark result using this repo with a few code.

 

I have submitted a PR to TiKV to optimize the like performance 
([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)],
 without UTF-8 support), and add collation support in 
[https://github.com/tikv/tikv/pull/6592], which can be simply port to 
data-fusion.

 

 

 

 


> Improve like/nlike performance
> ------------------------------
>
>                 Key: ARROW-8681
>                 URL: https://issues.apache.org/jira/browse/ARROW-8681
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust, Rust - DataFusion
>    Affects Versions: 0.17.0
>            Reporter: Zhuang Tianyi
>            Priority: Minor
>              Labels: performance
>
> Currently, the implementation of  `like_utf8` and `nlike_utf8` is based on 
> regex, which is simple and readable, but poor at the performance.
>  
> I do some benchmark in [https://github.com/TennyZhuang/like-bench/] , in this 
> repo, I compare three like algorithm.
>  # `like`(includes partial_like): this is the first naive version, using the 
> recursive approach, which will cause terrible performance on special attack 
> input, such as `a%a%a%a%a%a%a%a%b`.
>  # `like_to_regex`: which is almost the same as the current implementation in 
> arrow.
>  # `like_optimize`: the like problem is similar to glob in shell, so a 
> perfect solution is proposed in [https://research.swtch.com/glob] . The code 
> in the research is written golang but I translate it to rust.
>  
> In my benchmark result, the recursive solution can be ignored due to bad time 
> complexity lower bound.
> the regex solution will cost about 1000x time including regex compiling, and 
> about 4x time without regex compiling then solution 3. And It seems that the 
> code complexity of solution 3 is acceptable.
> Everyone can reproduce the benchmark result using this repo with a few codes.
>  
> I have submitted a PR to TiKV to optimize the like performance 
> ([https://github.com/tikv/tikv/pull/5866/files|https://github.com/tikv/tikv/pull/5866/files)],
>  without UTF-8 support), and add collation support in 
> [https://github.com/tikv/tikv/pull/6592], which can be simply port to 
> data-fusion.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (ARROW-8681) Improve like/nlike performance

Reply via email to