LuciferYang opened a new pull request, #56315: URL: https://github.com/apache/spark/pull/56315
### What changes were proposed in this pull request? This is a sub-task of [SPARK-56908](https://issues.apache.org/jira/browse/SPARK-56908) (reduce generated Java size in whole-stage codegen). `RegExpReplace` inlined the same match/replace loop in both `nullSafeEval` and `doGenCode` — the `matcher.region(...)`, the `while (find) { appendReplacement }` loop with a `try/catch` that builds `invalidRegexpReplaceError(...)`, and `appendTail`. So every generated whole-stage class that uses `regexp_replace` carried that ~20-line block plus the error-construction constant-pool entries. This PR moves the loop into a shared helper: ```scala object RegExpUtils { def replace(matcher: Matcher, subject: UTF8String, replacement: String, regexp: UTF8String, rep: UTF8String, pos: Int): UTF8String = { ... } } ``` Both `eval` and codegen now call it; `doGenCode` emits a single `RegExpUtils.replace(...)` call (via the Scala-object static forwarder, the same mechanism as the generated `QueryExecutionErrors.xxx` calls). This mirrors the existing `RegExpUtils.getPatternAndLastRegex` / `initLastMatcherCode` helpers in the same object. The eval-only cached `result` `StringBuilder` field is dropped — the helper allocates its own, and the codegen path always allocated one per call anyway. The Pattern/replacement caching (the per-instance mutable state) stays at the call sites; only the loop body moves. ### Why are the changes needed? To reduce the size of the generated Java in whole-stage codegen, as tracked by SPARK-56908. Moving the loop + error construction into one compiled method removes it from every generated class. Measured with `debugCodegen()` on `spark.range(1000).selectExpr("regexp_replace(cast(id as string), '0', 'x')")`, a single-`regexp_replace` whole-stage stage: | Metric | Before | After | |---|---|---| | `maxMethodCodeSize` | 551 | 435 (-21%) | | `maxConstantPoolSize` | 277 | 244 (-12%) | These are real (not Janino-foldable) reductions, replicated per stage class, with the loop body compiled once. ### Does this PR introduce _any_ user-facing change? No. This is a refactor; `eval` results, `nullable`, `dataType`, and the error message/args are unchanged, so SQL output and golden files are unaffected. One internal trade-off: interpreted `eval` now allocates a `StringBuilder` per row instead of reusing a cached field. This matches what the codegen path always did, and is negligible relative to the regex matching; it removes a `@transient lazy val` field. ### How was this patch tested? - `RegexpExpressionsSuite` "RegexReplace" — existing eval + codegen coverage, plus a new throwing-path case (a replacement referencing a non-existent group raises `INVALID_REGEXP_REPLACE`), which the tree previously did not cover; `checkErrorInExpression` exercises it in both interpreted and codegen modes. - The existing SPARK-22570 "RegExpReplace should not create a lot of global variables" test still passes (no new mutable state). - `StringFunctionsSuite` (`regexp_replace`) and `CollationSQLRegexpSuite`. Follow-up: `RegExpExtract` / `RegExpExtractAll` share a similar inline-loop shape and could get the same treatment in a separate sub-task. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
