andygrove commented on PR #4146:
URL: 
https://github.com/apache/datafusion-comet/pull/4146#issuecomment-4672660850

   Thanks for the careful review @mbutrovich. Pushed `367e07cd3` which 
addresses #2 and #3:
   
   **#3 — `LargeUtf8` result type.** Confirmed your hunch. Both 
`regexp_extract::extract_array` and `regexp_extract_all::extract_all_array` are 
now hardwired to a `StringBuilder` (i32 offsets) regardless of input width, so 
a `LargeUtf8` subject still yields a `StringArray` matching 
`RegExpExtract.dataType` = `StringType`. Doc comment on each function explains 
the constraint. Added two regression tests 
(`large_utf8_subject_returns_utf8_array` in `regexp_extract.rs`, 
`large_utf8_subject_returns_inner_utf8` in `regexp_extract_all.rs`) that 
downcast the result to `StringArray` and would fail loudly if either UDF ever 
drifts back.
   
   **#2 — Rust dedup.** New `regexp_extract_common.rs` with a 
`parse_args(fn_name, args)` helper that returns `ParsedArgs::Parsed { regex, 
group_idx, subject }` or `ParsedArgs::NullResult { len }`. Each UDF now starts 
with a one-match expression and is left with just its own per-row loop and 
per-function null shape (scalar `Utf8(None)` for extract, scalar/array 
`ListArray` of nulls for extract_all). Net diff is +1 file, ~120 lines removed 
across the two UDFs.
   
   **#1 — `CometRegExpBase` on the Scala side.** The two motivations you cited 
(shared `regexp` allow-incompat key, `RegExp.isSupportedPattern` upgrading safe 
patterns to `Compatible`) were both removed by #4239 — that PR made all four of 
`CometRLike`, `CometRegExpReplace`, `CometRegExpExtract`, 
`CometRegExpExtractAll` return `Compatible()` and rely on per-class 
`getExprConfigName(expr)` for the allow-incompat key, with the codegen 
dispatcher as the byte-exact fallback. So your proposed `regexSupportLevel` 
(which would return `Incompatible` for unsafe patterns) would be a regression — 
those patterns now route through the dispatcher and are byte-exact.
   
   The structural dedup itself is still real — all four serdes have a "`if 
isExprAllowIncompat AND nativeSupported THEN serialize ELSE 
emitJvmCodegenDispatch`" body. I'd rather scope that to a follow-up that can 
touch all four together (the `RLike` and `RegExpReplace` rewrites are out of 
this PR's lane), and tracked separately since the surface area — what 
`nativeSupported` means per-expression, what gets serialized — varies. Happy to 
file that issue if you agree, or leave a TODO in the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to