Dandandan opened a new pull request, #21379: URL: https://github.com/apache/datafusion/pull/21379
## Summary - For anchored patterns like `^prefix(capture)/.*$`, uses `regex-syntax` HIR analysis to build a shorter regex without the trailing `.*` - Uses `captures()` + `expand()` on the shorter regex instead of `replacen()`, since the original `^...$` pattern replaces the entire string and we only need correct capture group positions - For ClickBench Q28's `^https?://(?:www\.)?([^/]+)/.*$`, the effective regex becomes `^https?://(?:www\.)?([^/]+)/` — the backtracker stops at the first `/` after the domain instead of scanning the full URL ## Test plan - [x] Existing `regexp_replace` unit tests pass - [x] ClickBench regexp sqllogictests pass - [x] New sqllogictest for URL domain extraction with path suffixes (catches the correctness bug where `replacen` on a shorter regex would leave the suffix) - [ ] Benchmark with ClickBench Q28 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
