andygrove commented on PR #4146:
URL:
https://github.com/apache/datafusion-comet/pull/4146#issuecomment-4672660850
Thanks for the careful review @mbutrovich. Pushed `367e07cd3` which
addresses #2 and #3:
**#3 — `LargeUtf8` result type.** Confirmed your hunch. Both
`regexp_extract::extract_array` and `regexp_extract_all::extract_all_array` are
now hardwired to a `StringBuilder` (i32 offsets) regardless of input width, so
a `LargeUtf8` subject still yields a `StringArray` matching
`RegExpExtract.dataType` = `StringType`. Doc comment on each function explains
the constraint. Added two regression tests
(`large_utf8_subject_returns_utf8_array` in `regexp_extract.rs`,
`large_utf8_subject_returns_inner_utf8` in `regexp_extract_all.rs`) that
downcast the result to `StringArray` and would fail loudly if either UDF ever
drifts back.
**#2 — Rust dedup.** New `regexp_extract_common.rs` with a
`parse_args(fn_name, args)` helper that returns `ParsedArgs::Parsed { regex,
group_idx, subject }` or `ParsedArgs::NullResult { len }`. Each UDF now starts
with a one-match expression and is left with just its own per-row loop and
per-function null shape (scalar `Utf8(None)` for extract, scalar/array
`ListArray` of nulls for extract_all). Net diff is +1 file, ~120 lines removed
across the two UDFs.
**#1 — `CometRegExpBase` on the Scala side.** The two motivations you cited
(shared `regexp` allow-incompat key, `RegExp.isSupportedPattern` upgrading safe
patterns to `Compatible`) were both removed by #4239 — that PR made all four of
`CometRLike`, `CometRegExpReplace`, `CometRegExpExtract`,
`CometRegExpExtractAll` return `Compatible()` and rely on per-class
`getExprConfigName(expr)` for the allow-incompat key, with the codegen
dispatcher as the byte-exact fallback. So your proposed `regexSupportLevel`
(which would return `Incompatible` for unsafe patterns) would be a regression —
those patterns now route through the dispatcher and are byte-exact.
The structural dedup itself is still real — all four serdes have a "`if
isExprAllowIncompat AND nativeSupported THEN serialize ELSE
emitJvmCodegenDispatch`" body. I'd rather scope that to a follow-up that can
touch all four together (the `RLike` and `RegExpReplace` rewrites are out of
this PR's lane), and tracked separately since the surface area — what
`nativeSupported` means per-expression, what gets serialized — varies. Happy to
file that issue if you agree, or leave a TODO in the code.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]