olsemeno opened a new pull request, #20985:
URL: https://github.com/apache/datafusion/pull/20985
## Which issue does this PR close?
- Closes #.
## Rationale for this change
Apache DataFusion is missing a `regexp_extract` scalar function that is
available in Apache Spark (via `functions.regexp_extract`). This function is
commonly used in ETL pipelines to extract specific capture groups from strings
using regular expressions. Adding it improves compatibility with Spark
workloads migrating to DataFusion.
## What changes are included in this PR?
- Added a new scalar UDF `regexp_extract(str, regexp, idx)` in
`datafusion/functions/src/regex/regexpextract.rs`
- Accepts `Utf8`, `LargeUtf8`, and `Utf8View` string types
- Third argument `idx` (`Int64`) selects the capture group: `0` returns
the full match, `≥1` returns the corresponding group
- Returns an empty string `""` when the pattern does not match or the
requested group is absent (Spark-compatible behavior)
- Returns `NULL` when the input string is `NULL`
- Returns an error for negative `idx` values
- Supports per-row `idx` values (column argument), not just scalar literals
- Uses the existing `compile_and_cache_regex` helper for regex compilation
and caching
- Registered the function in `datafusion/functions/src/regex/mod.rs`
- Added `regexp_extract` usage examples to
`datafusion-examples/examples/builtin_functions/regexp.rs`
## Are these changes tested?
Yes. Ten unit tests are included in `regexpextract.rs` covering:
| Test | Scenario |
|---|---|
| `test_basic_group` | Extract a specific capture group |
| `test_idx_zero_returns_whole_match` | `idx=0` returns the full match |
| `test_no_match_returns_empty_string` | No match → `""` |
| `test_null_input_returns_null` | NULL input → NULL output |
| `test_empty_pattern_returns_empty_string` | Empty pattern → `""` |
| `test_idx_out_of_range_returns_empty_string` | Group index beyond
available groups → `""` |
| `test_negative_idx_returns_error` | Negative `idx` → error |
| `test_multiple_groups` | Multiple capture groups, select second |
| `test_idx_as_column` | `idx` as a per-row column (not scalar) |
| `test_batch_mixed_nulls` | Batch with mixed NULL and non-NULL rows |
## Are there any user-facing changes?
Yes. A new scalar function `regexp_extract(str, regexp, idx)` is now
available in both SQL and the DataFrame API.
```sql
SELECT regexp_extract('2024-03-16', '(\d{4})-(\d{2})-(\d{2})', 1);
-- returns: '2024'
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]