twilliamson opened a new pull request, #16153:
URL: https://github.com/apache/druid/pull/16153

   ### Description
   
   `LikeDimFilter` was compiling the `LIKE` clause down to a 
`java.util.regex.Pattern`. Unfortunately, even seemingly simply regexes can 
lead to [catastrophic 
backtracking](https://www.regular-expressions.info/catastrophic.html). In 
particular, something as simple as a few `%` wildcards can end up in [exploding 
the time 
complexity](https://www.rexegg.com/regex-explosive-quantifiers.html#remote). 
This MR implements a simple greedy algorithm that avoids backtracking.
   
   Technically, the algorithm runs in `O(nm)`, where `n` is the length of the 
string to match and `m` is the length of the pattern. In practice, it should 
run in linear time: essentially as fast as `String.indexOf()` can search for 
the next match. Running an updated version of the `LikeFilterBenchmark` with 
Java 11 on a `t2.xlarge` instance showed at least a 1.7x speed up for a simple 
"contains" query (`%50%`), and more than a 20x speed up for a "killer" query 
with four wildcards but no matches (`%%%%x`). The benchmark uses short strings: 
cases with longer strings should benefit more.
   
   Note that the `REGEX` operator still suffers from the same 
potentially-catastrophic runtimes. Using a better library than the built-in 
`java.util.regex.Pattern` (e.g., [joni](https://github.com/jruby/joni)) would 
be a good idea to avoid accidental — or intentional — DoSing.
   
   ```
   Benchmark                                (cardinality)  Mode  Cnt  Before 
Score       Error  After Score       Error  Units  Before / After
   LikeFilterBenchmark.matchBoundPrefix              1000  avgt   10         
6.686 ±     0.026        6.765 ±     0.087  us/op           0.99x
   LikeFilterBenchmark.matchBoundPrefix            100000  avgt   10       
163.936 ±     1.589      140.014 ±     0.563  us/op           1.17x
   LikeFilterBenchmark.matchBoundPrefix           1000000  avgt   10      
1235.259 ±     7.318     1165.330 ±     9.300  us/op           1.06x
   LikeFilterBenchmark.matchLikeContains             1000  avgt   10       
255.074 ±     1.530      130.212 ±     3.314  us/op           1.96x
   LikeFilterBenchmark.matchLikeContains           100000  avgt   10     
34789.639 ±   210.219    18563.644 ±   100.030  us/op           1.87x
   LikeFilterBenchmark.matchLikeContains          1000000  avgt   10    
287265.302 ±  1790.957   164684.778 ±   317.698  us/op           1.74x
   LikeFilterBenchmark.matchLikeEquals               1000  avgt   10         
0.410 ±     0.003        0.399 ±     0.001  us/op           1.03x
   LikeFilterBenchmark.matchLikeEquals             100000  avgt   10         
0.793 ±     0.005        0.719 ±     0.003  us/op           1.10x
   LikeFilterBenchmark.matchLikeEquals            1000000  avgt   10         
0.864 ±     0.004        0.839 ±     0.005  us/op           1.03x
   LikeFilterBenchmark.matchLikeKiller               1000  avgt   10      
3077.629 ±     7.928      103.714 ±     2.417  us/op          29.67x
   LikeFilterBenchmark.matchLikeKiller             100000  avgt   10    
311048.049 ± 13466.911    14777.567 ±    70.242  us/op          21.05x
   LikeFilterBenchmark.matchLikeKiller            1000000  avgt   10   
3055855.099 ± 18387.839    92476.621 ±  1198.255  us/op          33.04x
   LikeFilterBenchmark.matchLikePrefix               1000  avgt   10         
6.711 ±     0.035        6.653 ±     0.046  us/op           1.01x
   LikeFilterBenchmark.matchLikePrefix             100000  avgt   10       
161.535 ±     0.574      163.740 ±     0.833  us/op           0.99x
   LikeFilterBenchmark.matchLikePrefix            1000000  avgt   10      
1255.696 ±     5.207     1201.378 ±     3.466  us/op           1.05x
   LikeFilterBenchmark.matchRegexContains            1000  avgt   10       
467.736 ±     2.546      481.431 ±     5.647  us/op           0.97x
   LikeFilterBenchmark.matchRegexContains          100000  avgt   10     
64871.766 ±   223.341    65483.992 ±   391.249  us/op           0.99x
   LikeFilterBenchmark.matchRegexContains         1000000  avgt   10    
482906.004 ±  2003.583   477195.835 ±  3094.605  us/op           1.01x
   LikeFilterBenchmark.matchRegexKiller              1000  avgt   10      
8071.881 ±    18.026     8052.322 ±    17.336  us/op           1.00x
   LikeFilterBenchmark.matchRegexKiller            100000  avgt   10   
1120094.520 ±  2428.172   808321.542 ±  2411.032  us/op           1.39x
   LikeFilterBenchmark.matchRegexKiller           1000000  avgt   10   
8096745.012 ± 40782.747  8114114.896 ± 43250.204  us/op           1.00x
   LikeFilterBenchmark.matchRegexPrefix              1000  avgt   10       
170.843 ±     1.095      175.924 ±     1.144  us/op           0.97x
   LikeFilterBenchmark.matchRegexPrefix            100000  avgt   10     
17785.280 ±   116.813    18708.888 ±    61.857  us/op           0.95x
   LikeFilterBenchmark.matchRegexPrefix           1000000  avgt   10    
174415.586 ±  1827.478   173190.799 ±   949.224  us/op           1.01x
   LikeFilterBenchmark.matchSelectorEquals           1000  avgt   10         
0.411 ±     0.003        0.416 ±     0.002  us/op           0.99x
   LikeFilterBenchmark.matchSelectorEquals         100000  avgt   10         
0.728 ±     0.003        0.739 ±     0.003  us/op           0.99x
   LikeFilterBenchmark.matchSelectorEquals        1000000  avgt   10         
0.842 ±     0.002        0.879 ±     0.007  us/op           0.96x
   ```
   
   #### Release note
   
   Improved: `LIKE`-operator optimizations when there are multiple wildcards, 
with usage of `java.util.regex.Pattern` replaced by a simple greedy algorithm.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `LikeDimFilter` (the `LikeMatcher.from()` method is probably the 
most-complicated change)
    * `LikeDimFilter.LikePattern` (new class that replaces 
`java.util.regex.Pattern` for simple glob matching)
   
   <hr>
   
   This PR has:
   
   - [x] been self-reviewed.
   - [x] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [x] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [x] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to