mbutrovich commented on PR #4146:
URL:
https://github.com/apache/datafusion-comet/pull/4146#issuecomment-4567844694
Thanks for tackling these @andygrove, the support-level handling and the SQL
test shape (including the explicit `ignore` for lookahead) look great.
A few things I wanted to ask about:
### Could we lift this into a `CometRegExpBase`?
`CometRLike`, `CometRegExpReplace`, and the two new ones all carry the same
Rust-vs-Java caveat, and all four take a literal pattern as their first regex
arg. `RLike` and `RegExpReplace` already do two things the new pair doesn't:
1. They use the shared `regexp` allow-incompat key
(`CometConf.isExprAllowIncompat("regexp")`), so the user opts in once for the
whole family. The new PR introduces per-class `RegExpExtract.allowIncompatible`
/ `RegExpExtractAll.allowIncompatible` flags, which diverges.
2. They call `RegExp.isSupportedPattern(pattern)` to upgrade safe patterns
to `Compatible` rather than always being `Incompatible`.
Worth folding all four into a common base?
```scala
abstract class CometRegExpBase[T <: Expression] extends
CometExpressionSerde[T] {
protected def patternOf(expr: T): Expression
protected def isPatternSafe(expr: T): Boolean =
patternOf(expr) match {
case Literal(p, DataTypes.StringType) =>
RegExp.isSupportedPattern(p.toString)
case _ => false
}
override def getIncompatibleReasons(): Seq[String] = Seq(
"Uses Rust regexp engine, which has different behavior to Java regexp
engine")
protected def regexSupportLevel(expr: T): SupportLevel =
patternOf(expr) match {
case _: Literal if isPatternSafe(expr) => Compatible()
case _: Literal => Incompatible()
case _ => Unsupported(Some("Only scalar regexp patterns are
supported"))
}
}
```
`CometRegExpExtract` / `CometRegExpExtractAll` would then add their
`idx`-must-be-literal check on top of `regexSupportLevel`, and the existing
`CometRLike` / `CometRegExpReplace` lose their bespoke copies. That also gets
the new pair onto the shared `regexp` allow-incompat key, which I think is what
users would expect.
### Same idea on the Rust side?
`regexp_replace` is delegated upstream and `RLike` is its own
`PhysicalExpr`, so there's no big cross-family dedup available, but within this
PR `regexp_extract` and `regexp_extract_all` still share their arg parsing,
regex compile, group-count validation, `subject_len`, `null_result`, and
subject dispatch verbatim. A small shared helper that returns the parsed
`(regex, group_idx, subject)` tuple (or an early-exit null result) would leave
each UDF as just its own per-row loop:
```rust
pub(super) struct ParsedArgs<'a> {
pub regex: Regex,
pub group_idx: usize,
pub subject: &'a ColumnarValue,
}
pub(super) enum ParseResult<'a> {
Parsed(ParsedArgs<'a>),
NullResult(ColumnarValue), // null pattern or null idx short-circuit
}
pub(super) fn parse_regexp_extract_args<'a>(
fn_name: &'static str,
args: &'a [ColumnarValue],
) -> DataFusionResult<ParseResult<'a>> { ... }
```
### `LargeUtf8` result type in `regexp_extract`
`extract_array::<O>` is generic over the offset width, so a `LargeUtf8`
input array yields a `LargeStringArray`, but the serde declares the return type
as `expr.dataType` (`Utf8`). Is that intentional, or worth always returning
`Utf8` (the scalar branch already does)? Something like:
```rust
DataType::Utf8 | DataType::LargeUtf8 => {
let strings: &dyn Array = array.as_ref();
Ok(ColumnarValue::Array(extract_array(strings, ®ex, group_idx)))
}
```
with `extract_array` always producing a `StringArray` (i32 offsets), since
`&str` slices are width-agnostic. `regexp_extract_all` looks consistent here
since its value builder is always `StringBuilder`.
Possibly related: I didn't see a Rust unit test for a `LargeUtf8` subject in
either file, would be nice to lock that branch down.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]