[PR] [SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers [spark]

via GitHub Thu, 04 Jun 2026 02:07:23 -0700


LuciferYang opened a new pull request, #56318:
URL: https://github.com/apache/spark/pull/56318


   ### What changes were proposed in this pull request?
   
   This is a sub-task of 
[SPARK-56908](https://issues.apache.org/jira/browse/SPARK-56908) (reduce the 
size of generated Java code in whole-stage codegen).
   
   `RegExpExtract` and `RegExpExtractAll` inline the entire 
match-result-extraction logic into the generated Java produced by `doGenCode` 
(a `find()` / `toMatchResult()` / `checkGroupIndex` / `group(idx)` block, plus 
a `while` loop and an `ArrayList` accumulation for the `*All` variant). This 
duplicates the same logic that already exists in their `nullSafeEval` 
interpreted path and emits a large block into every generated class that uses 
these functions.
   
   This PR extracts that logic into two shared helpers on the existing `object 
RegExpExtractBase` (placed next to `checkGroupIndex`, which the generated code 
already calls the same way):
   
   - `RegExpExtractBase.extract(matcher, idx, prettyName): UTF8String`
   - `RegExpExtractBase.extractAll(matcher, idx, prettyName): GenericArrayData`
   
   Both `nullSafeEval` and `doGenCode` now call these helpers, so the generated 
Java is a single method call instead of an inline block. This mirrors the 
approach already used by `RegExpReplace` (`RegExpUtils.replace`, SPARK-57255 / 
#56315), reusing `RegExpExtractBase` here because `checkGroupIndex` is already 
co-located there.
   
   `RegExpInStr` (a third `RegExpExtractBase` subclass) is intentionally left 
unchanged: it returns the match start position rather than an extracted group, 
so these helpers do not apply.
   
   The unused `java.util.regex.MatchResult` import and the now-dead codegen 
locals (`matchResult`, `matchResults`, `arrayClass`) are removed.
   
   ### Why are the changes needed?
   
   Smaller generated methods reduce JIT/Janino pressure and the risk of hitting 
the 64KB method limit in wide whole-stage-codegen stages. Measured with 
`debugCodegen()` on a single-expression stage 
(`spark.range(1000).selectExpr(...)`):
   
   | Plan | `maxMethodCodeSize` | `maxConstantPoolSize` |
   |---|---|---|
   | `regexp_extract(cast(id as string), '([0-9]+)', 1)` | 415 -> 357 (-14.0%) 
| 260 -> 239 (-8.1%) |
   | `regexp_extract_all(cast(id as string), '([0-9])', 1)` | 569 -> 477 
(-16.2%) | 320 -> 285 (-10.9%) |
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is a behavior-preserving refactor. The interpreted and codegen 
paths produce identical results, including the 
`INVALID_PARAMETER_VALUE.REGEX_GROUP_INDEX` error contract (the group-index 
check still runs only after a successful match, so a non-matching input never 
throws).
   
   ### How was this patch tested?
   
   Existing `RegexpExpressionsSuite` tests for `RegExpExtract` and 
`RegExpExtractAll` pass (they exercise both interpreted and codegen via 
`checkEvaluation`, including the `REGEX_GROUP_INDEX` error path), and 
scalastyle is clean. No new test is needed because the refactor preserves 
behavior and the existing tests already cover both paths.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57258][SQL] Reduce regexp_extract/regexp_extract_all generated code size via shared extract helpers [spark]

Reply via email to