yangzhg edited a comment on issue #1929: Think about replacing RE with hyperscan URL: https://github.com/apache/incubator-doris/issues/1929#issuecomment-544862728 hyperscan has better performance than re2 base on my test ``` Regex: 'Twain' [ pcre] time: 4.0 ms (+/- 3.7 %), matches: 811 [ pcre-dfa] time: 12.2 ms (+/- 0.2 %), matches: 811 [ pcre-jit] time: 18.8 ms (+/- 0.5 %), matches: 811 [ re2] time: 2.9 ms (+/- 5.5 %), matches: 811 [ onig] time: 20.7 ms (+/- 1.0 %), matches: 811 [ tre] time: 268.7 ms (+/- 0.2 %), matches: 811 [ hscan] time: 1.9 ms (+/- 21.5 %), matches: 811 [rust_regex] time: 2.4 ms (+/- 3.4 %), matches: 811 ----------------- Regex: '(?i)Twain' [ pcre] time: 64.8 ms (+/- 1.3 %), matches: 965 [ pcre-dfa] time: 90.9 ms (+/- 0.2 %), matches: 965 [ pcre-jit] time: 19.7 ms (+/- 2.4 %), matches: 965 [ re2] time: 60.0 ms (+/- 1.1 %), matches: 965 [ onig] time: 41.5 ms (+/- 1.0 %), matches: 965 [ tre] time: 361.1 ms (+/- 0.4 %), matches: 965 [ hscan] time: 2.0 ms (+/- 21.2 %), matches: 965 [rust_regex] time: 22.8 ms (+/- 0.2 %), matches: 965 ----------------- Regex: '[a-z]shing' [ pcre] time: 453.9 ms (+/- 0.1 %), matches: 1540 [ pcre-dfa] time: 725.1 ms (+/- 0.2 %), matches: 1540 [ pcre-jit] time: 17.9 ms (+/- 0.8 %), matches: 1540 [ re2] time: 102.1 ms (+/- 0.6 %), matches: 1540 [ onig] time: 17.8 ms (+/- 0.9 %), matches: 1540 [ tre] time: 397.5 ms (+/- 0.4 %), matches: 1540 [ hscan] time: 4.6 ms (+/- 3.9 %), matches: 1540 [rust_regex] time: 7.1 ms (+/- 3.0 %), matches: 1540 ----------------- Regex: 'Huck[a-zA-Z]+|Saw[a-zA-Z]+' [ pcre] time: 21.7 ms (+/- 0.2 %), matches: 262 [ pcre-dfa] time: 23.1 ms (+/- 0.4 %), matches: 262 [ pcre-jit] time: 3.1 ms (+/- 2.3 %), matches: 262 [ re2] time: 39.7 ms (+/- 1.2 %), matches: 262 [ onig] time: 45.1 ms (+/- 2.0 %), matches: 262 [ tre] time: 476.1 ms (+/- 0.3 %), matches: 262 [ hscan] time: 2.8 ms (+/- 13.8 %), matches: 977 [rust_regex] time: 3.0 ms (+/- 1.5 %), matches: 262 ----------------- Regex: '\b\w+nn\b' [ pcre] time: 675.6 ms (+/- 0.4 %), matches: 262 [ pcre-dfa] time: 1036.5 ms (+/- 0.5 %), matches: 262 [ pcre-jit] time: 103.7 ms (+/- 0.7 %), matches: 262 [ re2] time: 42.9 ms (+/- 1.2 %), matches: 262 [ onig] time: 731.9 ms (+/- 0.7 %), matches: 262 [ tre] time: 732.3 ms (+/- 0.7 %), matches: 262 [ hscan] time: 131.1 ms (+/- 0.4 %), matches: 262 [rust_regex] time: 215.8 ms (+/- 0.4 %), matches: 262 ----------------- Regex: '[a-q][^u-z]{13}x' [ pcre] time: 555.4 ms (+/- 0.8 %), matches: 4094 [ pcre-dfa] time: 1880.2 ms (+/- 0.2 %), matches: 4094 [ pcre-jit] time: 2.5 ms (+/- 30.5 %), matches: 4094 [ re2] time: 185.1 ms (+/- 9.2 %), matches: 4094 [ onig] time: 44.2 ms (+/- 0.1 %), matches: 4094 [ tre] time: 1066.0 ms (+/- 0.5 %), matches: 4094 [ hscan] time: 87.1 ms (+/- 0.9 %), matches: 4094 [rust_regex] time: 3352.4 ms (+/- 1.4 %), matches: 4094 ----------------- Regex: 'Tom|Sawyer|Huckleberry|Finn' [ pcre] time: 30.1 ms (+/- 4.0 %), matches: 2598 [ pcre-dfa] time: 32.7 ms (+/- 4.6 %), matches: 2598 [ pcre-jit] time: 26.3 ms (+/- 0.4 %), matches: 2598 [ re2] time: 42.1 ms (+/- 2.2 %), matches: 2598 [ onig] time: 52.6 ms (+/- 5.7 %), matches: 2598 [ tre] time: 886.0 ms (+/- 0.8 %), matches: 2598 [ hscan] time: 3.3 ms (+/- 7.3 %), matches: 2598 [rust_regex] time: 47.6 ms (+/- 0.3 %), matches: 2598 ----------------- Regex: '(?i)Tom|Sawyer|Huckleberry|Finn' [ pcre] time: 353.8 ms (+/- 1.2 %), matches: 4152 [ pcre-dfa] time: 356.5 ms (+/- 0.1 %), matches: 4152 [ pcre-jit] time: 82.2 ms (+/- 1.0 %), matches: 4152 [ re2] time: 90.4 ms (+/- 0.7 %), matches: 4152 [ onig] time: 354.0 ms (+/- 0.6 %), matches: 4152 [ tre] time: 1278.5 ms (+/- 1.3 %), matches: 4152 [ hscan] time: 3.4 ms (+/- 14.1 %), matches: 4152 [rust_regex] time: 48.7 ms (+/- 0.8 %), matches: 4152 ----------------- Regex: '.{0,2}(Tom|Sawyer|Huckleberry|Finn)' [ pcre] time: 4643.9 ms (+/- 0.3 %), matches: 2598 [ pcre-dfa] time: 3536.5 ms (+/- 0.1 %), matches: 2598 [ pcre-jit] time: 319.5 ms (+/- 0.4 %), matches: 2598 [ re2] time: 49.0 ms (+/- 0.7 %), matches: 2598 [ onig] time: 86.6 ms (+/- 1.7 %), matches: 2598 [ tre] time: 2239.2 ms (+/- 0.1 %), matches: 2598 [ hscan] time: 3.3 ms (+/- 10.8 %), matches: 2598 [rust_regex] time: 48.1 ms (+/- 0.7 %), matches: 2598 ----------------- Regex: '.{2,4}(Tom|Sawyer|Huckleberry|Finn)' [ pcre] time: 4794.3 ms (+/- 0.6 %), matches: 1976 [ pcre-dfa] time: 4198.2 ms (+/- 0.1 %), matches: 1976 [ pcre-jit] time: 359.7 ms (+/- 0.3 %), matches: 1976 [ re2] time: 49.0 ms (+/- 1.2 %), matches: 1976 [ onig] time: 88.9 ms (+/- 2.1 %), matches: 1976 [ tre] time: 3267.7 ms (+/- 0.5 %), matches: 1976 [ hscan] time: 4.1 ms (+/- 18.7 %), matches: 2598 [rust_regex] time: 48.4 ms (+/- 2.0 %), matches: 1976 ----------------- Regex: 'Tom.{10,25}river|river.{10,25}Tom' [ pcre] time: 65.9 ms (+/- 1.0 %), matches: 2 [ pcre-dfa] time: 83.6 ms (+/- 0.5 %), matches: 2 [ pcre-jit] time: 16.6 ms (+/- 0.3 %), matches: 2 [ re2] time: 52.9 ms (+/- 0.7 %), matches: 2 [ onig] time: 83.6 ms (+/- 0.3 %), matches: 2 [ tre] time: 481.6 ms (+/- 0.5 %), matches: 2 [ hscan] time: 3.1 ms (+/- 19.4 %), matches: 4 [rust_regex] time: 18.7 ms (+/- 8.4 %), matches: 2 ----------------- Regex: '[a-zA-Z]+ing' [ pcre] time: 1068.3 ms (+/- 2.7 %), matches: 78424 [ pcre-dfa] time: 1595.4 ms (+/- 0.1 %), matches: 78424 [ pcre-jit] time: 77.6 ms (+/- 0.5 %), matches: 78424 [ re2] time: 118.8 ms (+/- 0.1 %), matches: 78424 [ onig] time: 656.7 ms (+/- 0.7 %), matches: 78424 [ tre] time: 506.6 ms (+/- 0.3 %), matches: 78424 [ hscan] time: 19.1 ms (+/- 0.7 %), matches: 78872 [rust_regex] time: 19.0 ms (+/- 4.7 %), matches: 78424 ----------------- Regex: '\s[a-zA-Z]{0,12}ing\s' [ pcre] time: 450.5 ms (+/- 1.0 %), matches: 55248 [ pcre-dfa] time: 625.3 ms (+/- 0.2 %), matches: 55248 [ pcre-jit] time: 111.6 ms (+/- 0.5 %), matches: 55248 [ re2] time: 71.8 ms (+/- 0.3 %), matches: 55248 [ onig] time: 77.0 ms (+/- 0.2 %), matches: 55248 [ tre] time: 747.6 ms (+/- 0.7 %), matches: 55248 [ hscan] time: 26.8 ms (+/- 1.6 %), matches: 55640 [rust_regex] time: 55.8 ms (+/- 0.6 %), matches: 55248 ----------------- Regex: '([A-Za-z]awyer|[A-Za-z]inn)\s' [ pcre] time: 998.7 ms (+/- 0.8 %), matches: 209 [ pcre-dfa] time: 1075.9 ms (+/- 0.6 %), matches: 209 [ pcre-jit] time: 40.2 ms (+/- 0.4 %), matches: 209 [ re2] time: 97.6 ms (+/- 0.6 %), matches: 209 [ onig] time: 180.7 ms (+/- 0.5 %), matches: 209 [ tre] time: 935.2 ms (+/- 1.0 %), matches: 209 [ hscan] time: 5.4 ms (+/- 3.4 %), matches: 209 [rust_regex] time: 47.7 ms (+/- 0.2 %), matches: 209 ----------------- Regex: '["'][^"']{0,30}[?!\.]["']' [ pcre] time: 57.0 ms (+/- 0.8 %), matches: 8886 [ pcre-dfa] time: 82.8 ms (+/- 0.9 %), matches: 8886 [ pcre-jit] time: 12.8 ms (+/- 1.0 %), matches: 8886 [ re2] time: 44.4 ms (+/- 0.6 %), matches: 8886 [ onig] time: 85.7 ms (+/- 0.3 %), matches: 8886 [ tre] time: 525.9 ms (+/- 0.6 %), matches: 8886 [ hscan] time: 17.5 ms (+/- 3.6 %), matches: 8898 [rust_regex] time: 12.4 ms (+/- 5.2 %), matches: 8886 ----------------- Regex: '∞|✓' [ pcre] time: 1.9 ms (+/- 6.1 %), matches: 2 [ pcre-dfa] time: 9.2 ms (+/- 0.7 %), matches: 2 [ pcre-jit] time: 2.6 ms (+/- 8.7 %), matches: 2 [ re2] time: 2.5 ms (+/- 6.4 %), matches: 2 [ onig] time: 41.2 ms (+/- 0.1 %), matches: 2 [ tre] time: 367.6 ms (+/- 0.2 %), matches: 2 [ hscan] time: 2.9 ms (+/- 15.0 %), matches: 2 [rust_regex] time: 47.5 ms (+/- 0.1 %), matches: 2 ----------------- Regex: '\p{Sm}' [ pcre] time: 460.7 ms (+/- 0.1 %), matches: 68 [ pcre-dfa] time: 692.3 ms (+/- 0.2 %), matches: 68 [ pcre-jit] time: 56.8 ms (+/- 0.7 %), matches: 68 [ re2] time: 36.7 ms (+/- 0.2 %), matches: 68 Onig compilation failed TRE compilation failed with error 10 [ hscan] time: 2.2 ms (+/- 5.0 %), matches: 68 [rust_regex] time: 47.7 ms (+/- 0.1 %), matches: 69 ----------------- Total Results: [ pcre] time: 14700.5 ms, score: 8 points, [ pcre-dfa] time: 16056.4 ms, score: 0 points, [ pcre-jit] time: 1271.6 ms, score: 41 points, [ re2] time: 1087.9 ms, score: 25 points, [ onig] time: 2608.1 ms, score: 7 points, [ tre] time: 14537.4 ms, score: 0 points, [ hscan] time: 320.6 ms, score: 73 points, [rust_regex] time: 4045.1 ms, score: 50 points, ``` but HyperScan does not support capturing group base on issus https://github.com/intel/hyperscan/issues/64  in our project most use of RE2::FullMatch is used to extract the sub-match data from regex string, so we cannot use HyperScan replace RE2
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
