LuciferYang opened a new pull request, #56318: URL: https://github.com/apache/spark/pull/56318
### What changes were proposed in this pull request? This is a sub-task of [SPARK-56908](https://issues.apache.org/jira/browse/SPARK-56908) (reduce the size of generated Java code in whole-stage codegen). `RegExpExtract` and `RegExpExtractAll` inline the entire match-result-extraction logic into the generated Java produced by `doGenCode` (a `find()` / `toMatchResult()` / `checkGroupIndex` / `group(idx)` block, plus a `while` loop and an `ArrayList` accumulation for the `*All` variant). This duplicates the same logic that already exists in their `nullSafeEval` interpreted path and emits a large block into every generated class that uses these functions. This PR extracts that logic into two shared helpers on the existing `object RegExpExtractBase` (placed next to `checkGroupIndex`, which the generated code already calls the same way): - `RegExpExtractBase.extract(matcher, idx, prettyName): UTF8String` - `RegExpExtractBase.extractAll(matcher, idx, prettyName): GenericArrayData` Both `nullSafeEval` and `doGenCode` now call these helpers, so the generated Java is a single method call instead of an inline block. This mirrors the approach already used by `RegExpReplace` (`RegExpUtils.replace`, SPARK-57255 / #56315), reusing `RegExpExtractBase` here because `checkGroupIndex` is already co-located there. `RegExpInStr` (a third `RegExpExtractBase` subclass) is intentionally left unchanged: it returns the match start position rather than an extracted group, so these helpers do not apply. The unused `java.util.regex.MatchResult` import and the now-dead codegen locals (`matchResult`, `matchResults`, `arrayClass`) are removed. ### Why are the changes needed? Smaller generated methods reduce JIT/Janino pressure and the risk of hitting the 64KB method limit in wide whole-stage-codegen stages. Measured with `debugCodegen()` on a single-expression stage (`spark.range(1000).selectExpr(...)`): | Plan | `maxMethodCodeSize` | `maxConstantPoolSize` | |---|---|---| | `regexp_extract(cast(id as string), '([0-9]+)', 1)` | 415 -> 357 (-14.0%) | 260 -> 239 (-8.1%) | | `regexp_extract_all(cast(id as string), '([0-9])', 1)` | 569 -> 477 (-16.2%) | 320 -> 285 (-10.9%) | ### Does this PR introduce _any_ user-facing change? No. This is a behavior-preserving refactor. The interpreted and codegen paths produce identical results, including the `INVALID_PARAMETER_VALUE.REGEX_GROUP_INDEX` error contract (the group-index check still runs only after a successful match, so a non-matching input never throws). ### How was this patch tested? Existing `RegexpExpressionsSuite` tests for `RegExpExtract` and `RegExpExtractAll` pass (they exercise both interpreted and codegen via `checkEvaluation`, including the `REGEX_GROUP_INDEX` error path), and scalastyle is clean. No new test is needed because the refactor preserves behavior and the existing tests already cover both paths. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
