LuciferYang opened a new pull request, #56292:
URL: https://github.com/apache/spark/pull/56292

   ### What changes were proposed in this pull request?
   
   This is a sub-task of 
[SPARK-56908](https://issues.apache.org/jira/browse/SPARK-56908) (reduce 
generated Java size in whole-stage codegen).
   
   `StringLocate.doGenCode` always emitted a null check for each of its three 
children (`substr`, `str`, `start`), plus a `boolean isNull = false` 
declaration, regardless of whether a child can actually be null. For a 
non-nullable child the codegen `isNull` is statically `false`, so these checks 
are dead code. In particular the 2-arg `locate(substr, str)` / `position(substr 
IN str)` form defaults `start` to a non-null literal, so the `if 
(!start.isNull)` guard was an always-true dead branch on every call.
   
   This PR emits each null check only when the corresponding child is nullable, 
and when the whole expression is non-nullable (i.e. both `substr` and `str` are 
non-nullable) it drops the `isNull` declaration and returns `isNull = 
FalseLiteral` so parent expressions can skip their own null check too.
   
   The behavior is unchanged. `eval` and the existing nesting are preserved: a 
null `start` returns `0` without evaluating `substr`/`str` (to conform with 
Hive), and the result is null iff a non-null `start` meets a null `substr` or 
`str`. When all three children are nullable the generated code is identical to 
before.
   
   For example, `locate(a, b)` on non-nullable columns goes from:
   
   ```java
   int value = 0;
   boolean isNull = false;
   /* a.code */
   if (!aIsNull) {
     /* b.code */
     if (!bIsNull) {
       /* start.code */    // start = Literal(1), always non-null
       if (!startIsNull) {
         if (start > 0) { value = ...exec(...) + 1; }
       } else { isNull = true; }
     } else { isNull = true; }
   }
   ```
   
   to:
   
   ```java
   int value = 0;
   /* start.code */
   /* a.code */
   /* b.code */
   if (start > 0) { value = ...exec(...) + 1; }
   ```
   
   This follows the same pattern as the merged sibling sub-tasks (e.g. `If`, 
`NaNvl`, `AtLeastNNonNulls`, `CreateNamedStruct`).
   
   ### Why are the changes needed?
   
   To reduce the size of the generated Java in whole-stage codegen, as tracked 
by SPARK-56908. The skipped guards are statically dead, and `locate`/`position` 
are common functions, so eliminating the always-true `start` guard and the dead 
null checks shrinks generated code on the hot path with no behavior change.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No. This is a codegen-only change; `eval`, `nullable`, `dataType`, and 
results are unchanged, so SQL output and golden files are unaffected.
   
   ### How was this patch tested?
   
   Existing `StringExpressionsSuite` "LOCATE" coverage plus new 
`checkEvaluation` cases covering every combination of nullable/non-nullable 
`substr`/`str`/`start` (so each generated codegen template is exercised; 
`checkEvaluation` runs interpreted and codegen, with whole-stage codegen both 
on and off). Also ran `org.apache.spark.sql.StringFunctionsSuite` 
(`locate`/`position`).
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   Generated-by: Claude Code (Claude Opus 4.8)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to