yangzhg edited a comment on issue #1929: Think about replacing RE with hyperscan
URL: 
https://github.com/apache/incubator-doris/issues/1929#issuecomment-544862728
 
 
   hyperscan has better performance than re2 base on my test
   ```
   Regex: 'Twain'
   [      pcre] time:     4.0 ms (+/-  3.7 %), matches:      811
   [  pcre-dfa] time:    12.2 ms (+/-  0.2 %), matches:      811
   [  pcre-jit] time:    18.8 ms (+/-  0.5 %), matches:      811
   [       re2] time:     2.9 ms (+/-  5.5 %), matches:      811
   [      onig] time:    20.7 ms (+/-  1.0 %), matches:      811
   [       tre] time:   268.7 ms (+/-  0.2 %), matches:      811
   [     hscan] time:     1.9 ms (+/- 21.5 %), matches:      811
   [rust_regex] time:     2.4 ms (+/-  3.4 %), matches:      811
   -----------------
   Regex: '(?i)Twain'
   [      pcre] time:    64.8 ms (+/-  1.3 %), matches:      965
   [  pcre-dfa] time:    90.9 ms (+/-  0.2 %), matches:      965
   [  pcre-jit] time:    19.7 ms (+/-  2.4 %), matches:      965
   [       re2] time:    60.0 ms (+/-  1.1 %), matches:      965
   [      onig] time:    41.5 ms (+/-  1.0 %), matches:      965
   [       tre] time:   361.1 ms (+/-  0.4 %), matches:      965
   [     hscan] time:     2.0 ms (+/- 21.2 %), matches:      965
   [rust_regex] time:    22.8 ms (+/-  0.2 %), matches:      965
   -----------------
   Regex: '[a-z]shing'
   [      pcre] time:   453.9 ms (+/-  0.1 %), matches:     1540
   [  pcre-dfa] time:   725.1 ms (+/-  0.2 %), matches:     1540
   [  pcre-jit] time:    17.9 ms (+/-  0.8 %), matches:     1540
   [       re2] time:   102.1 ms (+/-  0.6 %), matches:     1540
   [      onig] time:    17.8 ms (+/-  0.9 %), matches:     1540
   [       tre] time:   397.5 ms (+/-  0.4 %), matches:     1540
   [     hscan] time:     4.6 ms (+/-  3.9 %), matches:     1540
   [rust_regex] time:     7.1 ms (+/-  3.0 %), matches:     1540
   -----------------
   Regex: 'Huck[a-zA-Z]+|Saw[a-zA-Z]+'
   [      pcre] time:    21.7 ms (+/-  0.2 %), matches:      262
   [  pcre-dfa] time:    23.1 ms (+/-  0.4 %), matches:      262
   [  pcre-jit] time:     3.1 ms (+/-  2.3 %), matches:      262
   [       re2] time:    39.7 ms (+/-  1.2 %), matches:      262
   [      onig] time:    45.1 ms (+/-  2.0 %), matches:      262
   [       tre] time:   476.1 ms (+/-  0.3 %), matches:      262
   [     hscan] time:     2.8 ms (+/- 13.8 %), matches:      977
   [rust_regex] time:     3.0 ms (+/-  1.5 %), matches:      262
   -----------------
   Regex: '\b\w+nn\b'
   [      pcre] time:   675.6 ms (+/-  0.4 %), matches:      262
   [  pcre-dfa] time:  1036.5 ms (+/-  0.5 %), matches:      262
   [  pcre-jit] time:   103.7 ms (+/-  0.7 %), matches:      262
   [       re2] time:    42.9 ms (+/-  1.2 %), matches:      262
   [      onig] time:   731.9 ms (+/-  0.7 %), matches:      262
   [       tre] time:   732.3 ms (+/-  0.7 %), matches:      262
   [     hscan] time:   131.1 ms (+/-  0.4 %), matches:      262
   [rust_regex] time:   215.8 ms (+/-  0.4 %), matches:      262
   -----------------
   Regex: '[a-q][^u-z]{13}x'
   [      pcre] time:   555.4 ms (+/-  0.8 %), matches:     4094
   [  pcre-dfa] time:  1880.2 ms (+/-  0.2 %), matches:     4094
   [  pcre-jit] time:     2.5 ms (+/- 30.5 %), matches:     4094
   [       re2] time:   185.1 ms (+/-  9.2 %), matches:     4094
   [      onig] time:    44.2 ms (+/-  0.1 %), matches:     4094
   [       tre] time:  1066.0 ms (+/-  0.5 %), matches:     4094
   [     hscan] time:    87.1 ms (+/-  0.9 %), matches:     4094
   [rust_regex] time:  3352.4 ms (+/-  1.4 %), matches:     4094
   -----------------
   Regex: 'Tom|Sawyer|Huckleberry|Finn'
   [      pcre] time:    30.1 ms (+/-  4.0 %), matches:     2598
   [  pcre-dfa] time:    32.7 ms (+/-  4.6 %), matches:     2598
   [  pcre-jit] time:    26.3 ms (+/-  0.4 %), matches:     2598
   [       re2] time:    42.1 ms (+/-  2.2 %), matches:     2598
   [      onig] time:    52.6 ms (+/-  5.7 %), matches:     2598
   [       tre] time:   886.0 ms (+/-  0.8 %), matches:     2598
   [     hscan] time:     3.3 ms (+/-  7.3 %), matches:     2598
   [rust_regex] time:    47.6 ms (+/-  0.3 %), matches:     2598
   -----------------
   Regex: '(?i)Tom|Sawyer|Huckleberry|Finn'
   [      pcre] time:   353.8 ms (+/-  1.2 %), matches:     4152
   [  pcre-dfa] time:   356.5 ms (+/-  0.1 %), matches:     4152
   [  pcre-jit] time:    82.2 ms (+/-  1.0 %), matches:     4152
   [       re2] time:    90.4 ms (+/-  0.7 %), matches:     4152
   [      onig] time:   354.0 ms (+/-  0.6 %), matches:     4152
   [       tre] time:  1278.5 ms (+/-  1.3 %), matches:     4152
   [     hscan] time:     3.4 ms (+/- 14.1 %), matches:     4152
   [rust_regex] time:    48.7 ms (+/-  0.8 %), matches:     4152
   -----------------
   Regex: '.{0,2}(Tom|Sawyer|Huckleberry|Finn)'
   [      pcre] time:  4643.9 ms (+/-  0.3 %), matches:     2598
   [  pcre-dfa] time:  3536.5 ms (+/-  0.1 %), matches:     2598
   [  pcre-jit] time:   319.5 ms (+/-  0.4 %), matches:     2598
   [       re2] time:    49.0 ms (+/-  0.7 %), matches:     2598
   [      onig] time:    86.6 ms (+/-  1.7 %), matches:     2598
   [       tre] time:  2239.2 ms (+/-  0.1 %), matches:     2598
   [     hscan] time:     3.3 ms (+/- 10.8 %), matches:     2598
   [rust_regex] time:    48.1 ms (+/-  0.7 %), matches:     2598
   -----------------
   Regex: '.{2,4}(Tom|Sawyer|Huckleberry|Finn)'
   [      pcre] time:  4794.3 ms (+/-  0.6 %), matches:     1976
   [  pcre-dfa] time:  4198.2 ms (+/-  0.1 %), matches:     1976
   [  pcre-jit] time:   359.7 ms (+/-  0.3 %), matches:     1976
   [       re2] time:    49.0 ms (+/-  1.2 %), matches:     1976
   [      onig] time:    88.9 ms (+/-  2.1 %), matches:     1976
   [       tre] time:  3267.7 ms (+/-  0.5 %), matches:     1976
   [     hscan] time:     4.1 ms (+/- 18.7 %), matches:     2598
   [rust_regex] time:    48.4 ms (+/-  2.0 %), matches:     1976
   -----------------
   Regex: 'Tom.{10,25}river|river.{10,25}Tom'
   [      pcre] time:    65.9 ms (+/-  1.0 %), matches:        2
   [  pcre-dfa] time:    83.6 ms (+/-  0.5 %), matches:        2
   [  pcre-jit] time:    16.6 ms (+/-  0.3 %), matches:        2
   [       re2] time:    52.9 ms (+/-  0.7 %), matches:        2
   [      onig] time:    83.6 ms (+/-  0.3 %), matches:        2
   [       tre] time:   481.6 ms (+/-  0.5 %), matches:        2
   [     hscan] time:     3.1 ms (+/- 19.4 %), matches:        4
   [rust_regex] time:    18.7 ms (+/-  8.4 %), matches:        2
   -----------------
   Regex: '[a-zA-Z]+ing'
   [      pcre] time:  1068.3 ms (+/-  2.7 %), matches:    78424
   [  pcre-dfa] time:  1595.4 ms (+/-  0.1 %), matches:    78424
   [  pcre-jit] time:    77.6 ms (+/-  0.5 %), matches:    78424
   [       re2] time:   118.8 ms (+/-  0.1 %), matches:    78424
   [      onig] time:   656.7 ms (+/-  0.7 %), matches:    78424
   [       tre] time:   506.6 ms (+/-  0.3 %), matches:    78424
   [     hscan] time:    19.1 ms (+/-  0.7 %), matches:    78872
   [rust_regex] time:    19.0 ms (+/-  4.7 %), matches:    78424
   -----------------
   Regex: '\s[a-zA-Z]{0,12}ing\s'
   [      pcre] time:   450.5 ms (+/-  1.0 %), matches:    55248
   [  pcre-dfa] time:   625.3 ms (+/-  0.2 %), matches:    55248
   [  pcre-jit] time:   111.6 ms (+/-  0.5 %), matches:    55248
   [       re2] time:    71.8 ms (+/-  0.3 %), matches:    55248
   [      onig] time:    77.0 ms (+/-  0.2 %), matches:    55248
   [       tre] time:   747.6 ms (+/-  0.7 %), matches:    55248
   [     hscan] time:    26.8 ms (+/-  1.6 %), matches:    55640
   [rust_regex] time:    55.8 ms (+/-  0.6 %), matches:    55248
   -----------------
   Regex: '([A-Za-z]awyer|[A-Za-z]inn)\s'
   [      pcre] time:   998.7 ms (+/-  0.8 %), matches:      209
   [  pcre-dfa] time:  1075.9 ms (+/-  0.6 %), matches:      209
   [  pcre-jit] time:    40.2 ms (+/-  0.4 %), matches:      209
   [       re2] time:    97.6 ms (+/-  0.6 %), matches:      209
   [      onig] time:   180.7 ms (+/-  0.5 %), matches:      209
   [       tre] time:   935.2 ms (+/-  1.0 %), matches:      209
   [     hscan] time:     5.4 ms (+/-  3.4 %), matches:      209
   [rust_regex] time:    47.7 ms (+/-  0.2 %), matches:      209
   -----------------
   Regex: '["'][^"']{0,30}[?!\.]["']'
   [      pcre] time:    57.0 ms (+/-  0.8 %), matches:     8886
   [  pcre-dfa] time:    82.8 ms (+/-  0.9 %), matches:     8886
   [  pcre-jit] time:    12.8 ms (+/-  1.0 %), matches:     8886
   [       re2] time:    44.4 ms (+/-  0.6 %), matches:     8886
   [      onig] time:    85.7 ms (+/-  0.3 %), matches:     8886
   [       tre] time:   525.9 ms (+/-  0.6 %), matches:     8886
   [     hscan] time:    17.5 ms (+/-  3.6 %), matches:     8898
   [rust_regex] time:    12.4 ms (+/-  5.2 %), matches:     8886
   -----------------
   Regex: '∞|✓'
   [      pcre] time:     1.9 ms (+/-  6.1 %), matches:        2
   [  pcre-dfa] time:     9.2 ms (+/-  0.7 %), matches:        2
   [  pcre-jit] time:     2.6 ms (+/-  8.7 %), matches:        2
   [       re2] time:     2.5 ms (+/-  6.4 %), matches:        2
   [      onig] time:    41.2 ms (+/-  0.1 %), matches:        2
   [       tre] time:   367.6 ms (+/-  0.2 %), matches:        2
   [     hscan] time:     2.9 ms (+/- 15.0 %), matches:        2
   [rust_regex] time:    47.5 ms (+/-  0.1 %), matches:        2
   -----------------
   Regex: '\p{Sm}'
   [      pcre] time:   460.7 ms (+/-  0.1 %), matches:       68
   [  pcre-dfa] time:   692.3 ms (+/-  0.2 %), matches:       68
   [  pcre-jit] time:    56.8 ms (+/-  0.7 %), matches:       68
   [       re2] time:    36.7 ms (+/-  0.2 %), matches:       68
   Onig compilation failed
   TRE compilation failed with error 10
   [     hscan] time:     2.2 ms (+/-  5.0 %), matches:       68
   [rust_regex] time:    47.7 ms (+/-  0.1 %), matches:       69
   -----------------
   Total Results:
   [      pcre] time:  14700.5 ms, score:      8 points,
   [  pcre-dfa] time:  16056.4 ms, score:      0 points,
   [  pcre-jit] time:   1271.6 ms, score:     41 points,
   [       re2] time:   1087.9 ms, score:     25 points,
   [      onig] time:   2608.1 ms, score:      7 points,
   [       tre] time:  14537.4 ms, score:      0 points,
   [     hscan] time:    320.6 ms, score:     73 points,
   [rust_regex] time:   4045.1 ms, score:     50 points,
   ```
    but HyperScan does not support capturing group  base on issus 
https://github.com/intel/hyperscan/issues/64
   
![image](https://user-images.githubusercontent.com/9098473/67270306-fc10e100-f4ea-11e9-9276-176d202b4b8c.png)
   
   
   in our project most use of RE2::FullMatch  is used to extract the sub-match 
data from regex string, so we cannot use HyperScan replace RE2

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to