bersprockets opened a new pull request, #40766:
URL: https://github.com/apache/spark/pull/40766

   ### What changes were proposed in this pull request?
   
   In `JoinCodegenSupport#getJoinCondition`, evaluate any referenced 
stream-side variables before using them in the generated code.
   
   This patch doesn't evaluate the passed stream-side variables directly, but 
instead evaluates a copy (`streamVars2`). This is because 
`SortMergeJoin#codegenFullOuter` will want to evaluate the stream-side vars 
within a different scope than the condition check, so we mustn't delete the 
initialization code from the original `ExprCode` instances.
   
   ### Why are the changes needed?
   
   When a bound condition of a full outer join references the same stream-side 
column more than once, wholestage codegen generates bad code.
   
   For example, the following query fails with a compilation error:
   
   ```
   create or replace temp view v1 as
   select * from values
   (1, 1),
   (2, 2),
   (3, 1)
   as v1(key, value);
   
   create or replace temp view v2 as
   select * from values
   (1, 22, 22),
   (3, -1, -1),
   (7, null, null)
   as v2(a, b, c);
   
   select *
   from v1
   full outer join v2
   on key = a
   and value > b
   and value > c;
   ```
   The error is:
   ```
   org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
277, Column 9: Redefinition of local variable "smj_isNull_7"
   ```
   The same error occurs with code generated from ShuffleHashJoinExec:
   ```
   select /*+ SHUFFLE_HASH(v2) */ *
   from v1
   full outer join v2
   on key = a
   and value > b
   and value > c;
   ```
   In this case, the error is:
   ```
   org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
174, Column 5: Redefinition of local variable "shj_value_1" 
   ```
   Neither `SortMergeJoin#codegenFullOuter` nor 
`ShuffledHashJoinExec#doProduce` evaluate the stream-side variables before 
calling `consumeFullOuterJoinRow#getJoinCondition`. As a result, 
`getJoinCondition` generates definition/initialization code for each referenced 
stream-side variable at the point of use. If a stream-side variable is used 
more than once in the bound condition, the definition/initialization code is 
generated more than once, resulting in the "Redefinition of local variable" 
error.
   
   In the end, the query succeeds, since Spark disables wholestage codegen and 
tries again.
   
   (In the case other join-type/strategy pairs, either the implementations 
don't call `JoinCodegenSupport#getJoinCondition`, or the stream-side variables 
are pre-evaluated before the call is made, so no error happens in those cases).
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to