Dandandan opened a new pull request, #21379:
URL: https://github.com/apache/datafusion/pull/21379

   ## Summary
   - For anchored patterns like `^prefix(capture)/.*$`, uses `regex-syntax` HIR 
analysis to build a shorter regex without the trailing `.*`
   - Uses `captures()` + `expand()` on the shorter regex instead of 
`replacen()`, since the original `^...$` pattern replaces the entire string and 
we only need correct capture group positions
   - For ClickBench Q28's `^https?://(?:www\.)?([^/]+)/.*$`, the effective 
regex becomes `^https?://(?:www\.)?([^/]+)/` — the backtracker stops at the 
first `/` after the domain instead of scanning the full URL
   
   ## Test plan
   - [x] Existing `regexp_replace` unit tests pass
   - [x] ClickBench regexp sqllogictests pass
   - [x] New sqllogictest for URL domain extraction with path suffixes (catches 
the correctness bug where `replacen` on a shorter regex would leave the suffix)
   - [ ] Benchmark with ClickBench Q28
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to