twilliamson commented on PR #16153:
URL: https://github.com/apache/druid/pull/16153#issuecomment-2057525485

   I got the custom implementation to work, but it's pretty complicated, and 
I'm bit worried about there being more bugs. Similarly, switching the regex 
library to `joni` or `re2` introduces a fairly big new dependency (though that 
probably needs to be done at some point for the `REGEXP_*` methods).
   
   Fortunately, there's a nice middle ground: use the list-of-patterns approach 
for handling the `%` matching (and thereby avoid catastrophic backtracking), 
but continue to use `java.util.regex` for handling the parts containing only 
`_` and literals (using `^` and `$` to handle prefix/suffix anchoring), e.g., 
`a%b_c%d` becomes `[^a, b.c, d$]`.
   
   This ends up boiling down to changing the `from()` parsing logic, then 
updating:
   ```
       private static DruidPredicateMatch matches(@Nullable final String s, 
Pattern pattern)
       {
         String val = NullHandling.nullToEmptyIfNeeded(s);
         if (val == null) {
           return DruidPredicateMatch.UNKNOWN;
         }
         return DruidPredicateMatch.of(pattern.matcher(val).matches());
       }
   ```
   to:
   ```
       private static DruidPredicateMatch matches(@Nullable final String s, 
List<Pattern> pattern)
       {
         String val = NullHandling.nullToEmptyIfNeeded(s);
         if (val == null) {
           return DruidPredicateMatch.UNKNOWN;
         }
   
         if (pattern.size() == 1) {
           // The common case is a single pattern: a% => ^a, %z => z$, %m% => m
           return DruidPredicateMatch.of(pattern.get(0).matcher(val).find());
         }
   
         int offset = 0;
   
         for (Pattern part : pattern) {
           Matcher matcher = part.matcher(val);
   
           if (!matcher.find(offset)) {
             return DruidPredicateMatch.FALSE;
           }
   
           offset = matcher.end();
         }
   
         return DruidPredicateMatch.TRUE;
       }
   ```
   
   The full benchmark suite is still running on a dedicated box, but local 
laptop testing confirms that the "killer regex" is no longer killer, and 
suggests performance is the same or better across the board (though not as much 
as with the fully-custom implementation, e.g., ~1.3x faster `contains`/`suffix` 
vs ~2x faster).
   
   The list-of-patterns approach seems like a sweet spot of bounded worst-case 
and slightly-better average performance, and a simple-enough implementation to 
avoid introducing bugs and not increase on-going maintenance overhead. I'm not 
sure whether it's better to update this PR with that approach (i.e., add 
another commit that rolls back the custom code), or create a new PR? And should 
I keep the new tests and benchmarks?
   
   @gianm What are your thoughts? How would you like me to proceed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

Reply via email to