[PR] [SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper [spark]

via GitHub Wed, 03 Jun 2026 22:39:36 -0700


LuciferYang opened a new pull request, #56315:
URL: https://github.com/apache/spark/pull/56315


   ### What changes were proposed in this pull request?
   
   This is a sub-task of 
[SPARK-56908](https://issues.apache.org/jira/browse/SPARK-56908) (reduce 
generated Java size in whole-stage codegen).
   
   `RegExpReplace` inlined the same match/replace loop in both `nullSafeEval` 
and `doGenCode` — the `matcher.region(...)`, the `while (find) { 
appendReplacement }` loop with a `try/catch` that builds 
`invalidRegexpReplaceError(...)`, and `appendTail`. So every generated 
whole-stage class that uses `regexp_replace` carried that ~20-line block plus 
the error-construction constant-pool entries.
   
   This PR moves the loop into a shared helper:
   
   ```scala
   object RegExpUtils {
     def replace(matcher: Matcher, subject: UTF8String, replacement: String,
                 regexp: UTF8String, rep: UTF8String, pos: Int): UTF8String = { 
... }
   }
   ```
   
   Both `eval` and codegen now call it; `doGenCode` emits a single 
`RegExpUtils.replace(...)` call (via the Scala-object static forwarder, the 
same mechanism as the generated `QueryExecutionErrors.xxx` calls). This mirrors 
the existing `RegExpUtils.getPatternAndLastRegex` / `initLastMatcherCode` 
helpers in the same object. The eval-only cached `result` `StringBuilder` field 
is dropped — the helper allocates its own, and the codegen path always 
allocated one per call anyway.
   
   The Pattern/replacement caching (the per-instance mutable state) stays at 
the call sites; only the loop body moves.
   
   ### Why are the changes needed?
   
   To reduce the size of the generated Java in whole-stage codegen, as tracked 
by SPARK-56908. Moving the loop + error construction into one compiled method 
removes it from every generated class. Measured with `debugCodegen()` on 
`spark.range(1000).selectExpr("regexp_replace(cast(id as string), '0', 'x')")`, 
a single-`regexp_replace` whole-stage stage:
   
   | Metric | Before | After |
   |---|---|---|
   | `maxMethodCodeSize` | 551 | 435 (-21%) |
   | `maxConstantPoolSize` | 277 | 244 (-12%) |
   
   These are real (not Janino-foldable) reductions, replicated per stage class, 
with the loop body compiled once.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is a refactor; `eval` results, `nullable`, `dataType`, and the 
error message/args are unchanged, so SQL output and golden files are unaffected.
   
   One internal trade-off: interpreted `eval` now allocates a `StringBuilder` 
per row instead of reusing a cached field. This matches what the codegen path 
always did, and is negligible relative to the regex matching; it removes a 
`@transient lazy val` field.
   
   ### How was this patch tested?
   
   - `RegexpExpressionsSuite` "RegexReplace" — existing eval + codegen 
coverage, plus a new throwing-path case (a replacement referencing a 
non-existent group raises `INVALID_REGEXP_REPLACE`), which the tree previously 
did not cover; `checkErrorInExpression` exercises it in both interpreted and 
codegen modes.
   - The existing SPARK-22570 "RegExpReplace should not create a lot of global 
variables" test still passes (no new mutable state).
   - `StringFunctionsSuite` (`regexp_replace`) and `CollationSQLRegexpSuite`.
   
   Follow-up: `RegExpExtract` / `RegExpExtractAll` share a similar inline-loop 
shape and could get the same treatment in a separate sub-task.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Claude Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-57255][SQL] Simplify RegExpReplace codegen by extracting the match/replace loop into a shared helper [spark]

Reply via email to