Re: [PR] feat: add native implementations of `regexp_extract` and `regexp_extract_all` [datafusion-comet]

via GitHub Thu, 28 May 2026 13:17:27 -0700


mbutrovich commented on PR #4146:
URL: 
https://github.com/apache/datafusion-comet/pull/4146#issuecomment-4567844694


   Thanks for tackling these @andygrove, the support-level handling and the SQL 
test shape (including the explicit `ignore` for lookahead) look great.
   
   A few things I wanted to ask about:
   
   ### Could we lift this into a `CometRegExpBase`?
   
   `CometRLike`, `CometRegExpReplace`, and the two new ones all carry the same 
Rust-vs-Java caveat, and all four take a literal pattern as their first regex 
arg. `RLike` and `RegExpReplace` already do two things the new pair doesn't:
   
   1. They use the shared `regexp` allow-incompat key 
(`CometConf.isExprAllowIncompat("regexp")`), so the user opts in once for the 
whole family. The new PR introduces per-class `RegExpExtract.allowIncompatible` 
/ `RegExpExtractAll.allowIncompatible` flags, which diverges.
   2. They call `RegExp.isSupportedPattern(pattern)` to upgrade safe patterns 
to `Compatible` rather than always being `Incompatible`.
   
   Worth folding all four into a common base?
   
   ```scala
   abstract class CometRegExpBase[T <: Expression] extends 
CometExpressionSerde[T] {
     protected def patternOf(expr: T): Expression
     protected def isPatternSafe(expr: T): Boolean =
       patternOf(expr) match {
         case Literal(p, DataTypes.StringType) => 
RegExp.isSupportedPattern(p.toString)
         case _ => false
       }
   
     override def getIncompatibleReasons(): Seq[String] = Seq(
       "Uses Rust regexp engine, which has different behavior to Java regexp 
engine")
   
     protected def regexSupportLevel(expr: T): SupportLevel =
       patternOf(expr) match {
         case _: Literal if isPatternSafe(expr) => Compatible()
         case _: Literal => Incompatible()
         case _ => Unsupported(Some("Only scalar regexp patterns are 
supported"))
       }
   }
   ```
   
   `CometRegExpExtract` / `CometRegExpExtractAll` would then add their 
`idx`-must-be-literal check on top of `regexSupportLevel`, and the existing 
`CometRLike` / `CometRegExpReplace` lose their bespoke copies. That also gets 
the new pair onto the shared `regexp` allow-incompat key, which I think is what 
users would expect.
   
   ### Same idea on the Rust side?
   
   `regexp_replace` is delegated upstream and `RLike` is its own 
`PhysicalExpr`, so there's no big cross-family dedup available, but within this 
PR `regexp_extract` and `regexp_extract_all` still share their arg parsing, 
regex compile, group-count validation, `subject_len`, `null_result`, and 
subject dispatch verbatim. A small shared helper that returns the parsed 
`(regex, group_idx, subject)` tuple (or an early-exit null result) would leave 
each UDF as just its own per-row loop:
   
   ```rust
   pub(super) struct ParsedArgs<'a> {
       pub regex: Regex,
       pub group_idx: usize,
       pub subject: &'a ColumnarValue,
   }
   
   pub(super) enum ParseResult<'a> {
       Parsed(ParsedArgs<'a>),
       NullResult(ColumnarValue), // null pattern or null idx short-circuit
   }
   
   pub(super) fn parse_regexp_extract_args<'a>(
       fn_name: &'static str,
       args: &'a [ColumnarValue],
   ) -> DataFusionResult<ParseResult<'a>> { ... }
   ```
   
   ### `LargeUtf8` result type in `regexp_extract`
   
   `extract_array::<O>` is generic over the offset width, so a `LargeUtf8` 
input array yields a `LargeStringArray`, but the serde declares the return type 
as `expr.dataType` (`Utf8`). Is that intentional, or worth always returning 
`Utf8` (the scalar branch already does)? Something like:
   
   ```rust
   DataType::Utf8 | DataType::LargeUtf8 => {
       let strings: &dyn Array = array.as_ref();
       Ok(ColumnarValue::Array(extract_array(strings, &regex, group_idx)))
   }
   ```
   
   with `extract_array` always producing a `StringArray` (i32 offsets), since 
`&str` slices are width-agnostic. `regexp_extract_all` looks consistent here 
since its value builder is always `StringBuilder`.
   
   Possibly related: I didn't see a Rust unit test for a `LargeUtf8` subject in 
either file, would be nice to lock that branch down.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: add native implementations of `regexp_extract` and `regexp_extract_all` [datafusion-comet]

Reply via email to