twilliamson commented on PR #16153: URL: https://github.com/apache/druid/pull/16153#issuecomment-2040716015
This info is from 5 or 6 years back while working on stream processing systems at Facebook, but my recollection is that `re2j` had issues with UTF-8 multi-byte sequences. Not sure if that's still the case, but I remember it not working as a drop-in replacement. We checked out what the Trino folks were doing at the time, and that's what led to us using Joni, which we were able to switch to without any of our pipeline owners noticing. From what I can remember, while it doesn't make hard runtime guarantees, in practice we didn't see it run into the same pathological behavior, but would still sometimes see exceptions for certain inputs (maybe `StackOverflowException`? but you also get those with `java.util.regex`…). Just checked, and it looks like Trino has since updated to [a custom `LIKE` implementation based on DFAs](https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/likematcher/DFA.java). (It looks quite complicated — I'm tempted to submit a MR to Trino with the same approach as in this MR…) Trino appears to still be [using Joni](https://github.com/trinodb/trino/blob/4d6df76558c999f01560f68f96698c441299e000/core/trino-main/src/main/java/io/trino/operator/scalar/JoniRegexpFunctions.java#L48) as the default for `regexp_* functions`, with [an option to use `re2j`](https://github.com/trinodb/trino/blob/4d6df76558c999f01560f68f96698c441299e000/core/trino-main/src/main/java/io/trino/sql/analyzer/RegexLibrary.java). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
