itschrispeck opened a new issue, #12628:
URL: https://github.com/apache/pinot/issues/12628
Currently Pinot uses `java.util.Regex` package. This generally performs
well, but it does not handle patterns that cause catastrophic backtracking
gracefully.
For clusters in shared environments that take adhoc queries it's possible
for a poorly written regex to hold resources indefinitely. Seen below, a query
worker thread is still at `java.util.regex.Pattern$CharPropertyGreedy.match`
hours after the problematic queries were executed:
```
"pqw-54" #121060 [121613] prio=5 os_prio=0 cpu=7109969.11ms
elapsed=188953.55s tid=0x00007fd16a303800 nid=121613 runnable
[0x00007fe9c13fc000] java.lang.Thread.State: RUNNABLE
at
java.util.regex.Pattern$CharPropertyGreedy.match([email protected]/Pattern.java:4470)
at
java.util.regex.Pattern$Start.match([email protected]/Pattern.java:3787)
at java.util.regex.Matcher.search([email protected]/Matcher.java:1767)
at java.util.regex.Matcher.find([email protected]/Matcher.java:787)
at
org.apache.pinot.core.operator.filter.predicate.RegexpLikePredicateEvaluatorFactory$RawValueBasedRegexpLikePredicateEvaluator.applySV(RegexpLikePredicateEvaluatorFactory.java:129)
```
Google's re2 libray was in part [created to address
this](https://github.com/google/re2/wiki/WhyRE2):
> RE2 was designed and implemented with an explicit goal of being able to
handle regular expressions from untrusted users without risk. One of its
primary guarantees is that the match time is linear in the length of the input
string. It was also written with production concerns in mind: the parser, the
compiler and the execution engines limit their memory usage by working within a
configurable budget – failing gracefully when exhausted – and they avoid stack
overflow by eschewing recursion.
re2 is used by other DBs such as ClickHouse.
If `re2j` seems to be the right approach, we could:
1. Make a clean switch to [re2j](https://github.com/google/re2j) which
carries some behavior/performance differences
2. Add a config to allow users to choose their desired implementation
3. Take an additional argument in `regexp_like` (I feel we should not leak
the implementation like this)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]