itschrispeck opened a new issue, #12628:
URL: https://github.com/apache/pinot/issues/12628

   Currently Pinot uses `java.util.Regex` package. This generally performs 
well, but it does not handle patterns that cause catastrophic backtracking 
gracefully.
   
   For clusters in shared environments that take adhoc queries it's possible 
for a poorly written regex to hold resources indefinitely. Seen below, a query 
worker thread is still at `java.util.regex.Pattern$CharPropertyGreedy.match` 
hours after the problematic queries were executed:
   
   ```
   "pqw-54" #121060 [121613] prio=5 os_prio=0 cpu=7109969.11ms 
elapsed=188953.55s tid=0x00007fd16a303800 nid=121613 runnable  
[0x00007fe9c13fc000]   java.lang.Thread.State: RUNNABLE
        at 
java.util.regex.Pattern$CharPropertyGreedy.match([email protected]/Pattern.java:4470)
        at 
java.util.regex.Pattern$Start.match([email protected]/Pattern.java:3787)
        at java.util.regex.Matcher.search([email protected]/Matcher.java:1767)
        at java.util.regex.Matcher.find([email protected]/Matcher.java:787)
        at 
org.apache.pinot.core.operator.filter.predicate.RegexpLikePredicateEvaluatorFactory$RawValueBasedRegexpLikePredicateEvaluator.applySV(RegexpLikePredicateEvaluatorFactory.java:129)
   ```
   
   Google's re2 libray was in part [created to address 
this](https://github.com/google/re2/wiki/WhyRE2):
   > RE2 was designed and implemented with an explicit goal of being able to 
handle regular expressions from untrusted users without risk. One of its 
primary guarantees is that the match time is linear in the length of the input 
string. It was also written with production concerns in mind: the parser, the 
compiler and the execution engines limit their memory usage by working within a 
configurable budget – failing gracefully when exhausted – and they avoid stack 
overflow by eschewing recursion.
   
   re2 is used by other DBs such as ClickHouse. 
   
   If `re2j` seems to be the right approach, we could:
   1. Make a clean switch to [re2j](https://github.com/google/re2j) which 
carries some behavior/performance differences
   2. Add a config to allow users to choose their desired implementation
   3. Take an additional argument in `regexp_like` (I feel we should not leak 
the implementation like this)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to