Shekharrajak commented on PR #2772: URL: https://github.com/apache/datafusion-comet/pull/2772#issuecomment-3875299422
> > This looks great overall @Shekharrajak, but could you mark it as incompatible as described in [#2772 (comment)](https://github.com/apache/datafusion-comet/pull/2772#issuecomment-3529547218). I don't think it is realistic to be able to match Spark behavior fully, at least not as part of this PR. > > In future PRs, we can probably implement fallback rules for specific cases, such as Java-specific regex patterns. > > I was able to find cases where Comet crashes at runtime when using lookahead, lookbehind, and back references, for example. Explore about it and writing down the findings for future reference: Java's regex engine supports advanced features: • Lookahead: (?=...), (?!...) • Lookbehind: (?<=...), (?<!...) • Back references: \1, \2 Rust's `regex` crate intentionally does not support any of these. It guarantees linear O(n) time by design, which means no backtracking -- and all three of those features require backtracking. So if a user writes something like: SELECT split(email, '(?<=@)') FROM users * Spark: works fine -- Java regex handles the lookbehind. * Comet: the pattern gets passed to Rust's regex engine, which errors out at runtime because (?<=@) is unsupported syntax -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
