olsemeno opened a new pull request, #20985:
URL: https://github.com/apache/datafusion/pull/20985

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   Apache DataFusion is missing a `regexp_extract` scalar function that is 
available in Apache Spark (via `functions.regexp_extract`). This function is 
commonly used in ETL pipelines to extract specific capture groups from strings 
using regular expressions. Adding it improves compatibility with Spark 
workloads migrating to DataFusion.
   
   ## What changes are included in this PR?
   
   - Added a new scalar UDF `regexp_extract(str, regexp, idx)` in 
`datafusion/functions/src/regex/regexpextract.rs`
     - Accepts `Utf8`, `LargeUtf8`, and `Utf8View` string types
     - Third argument `idx` (`Int64`) selects the capture group: `0` returns 
the full match, `≥1` returns the corresponding group
     - Returns an empty string `""` when the pattern does not match or the 
requested group is absent (Spark-compatible behavior)
     - Returns `NULL` when the input string is `NULL`
     - Returns an error for negative `idx` values
     - Supports per-row `idx` values (column argument), not just scalar literals
     - Uses the existing `compile_and_cache_regex` helper for regex compilation 
and caching
   - Registered the function in `datafusion/functions/src/regex/mod.rs`
   - Added `regexp_extract` usage examples to 
`datafusion-examples/examples/builtin_functions/regexp.rs`
   
   ## Are these changes tested?
   
   Yes. Ten unit tests are included in `regexpextract.rs` covering:
   
   | Test | Scenario |
   |---|---|
   | `test_basic_group` | Extract a specific capture group |
   | `test_idx_zero_returns_whole_match` | `idx=0` returns the full match |
   | `test_no_match_returns_empty_string` | No match → `""` |
   | `test_null_input_returns_null` | NULL input → NULL output |
   | `test_empty_pattern_returns_empty_string` | Empty pattern → `""` |
   | `test_idx_out_of_range_returns_empty_string` | Group index beyond 
available groups → `""` |
   | `test_negative_idx_returns_error` | Negative `idx` → error |
   | `test_multiple_groups` | Multiple capture groups, select second |
   | `test_idx_as_column` | `idx` as a per-row column (not scalar) |
   | `test_batch_mixed_nulls` | Batch with mixed NULL and non-NULL rows |
   
   ## Are there any user-facing changes?
   
   Yes. A new scalar function `regexp_extract(str, regexp, idx)` is now 
available in both SQL and the DataFrame API.
   
   ```sql
   SELECT regexp_extract('2024-03-16', '(\d{4})-(\d{2})-(\d{2})', 1);
   -- returns: '2024'
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to