[ 
https://issues.apache.org/jira/browse/SPARK-44132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735976#comment-17735976
 ] 

Bruce Robbins commented on SPARK-44132:
---------------------------------------

[~steven.aerts] Go for it!

> nesting full outer joins confuses code generator
> ------------------------------------------------
>
>                 Key: SPARK-44132
>                 URL: https://issues.apache.org/jira/browse/SPARK-44132
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0, 3.4.0, 3.5.0
>         Environment: We verified the existence of this bug from spark 3.3 
> until spark 3.5.
>            Reporter: Steven Aerts
>            Priority: Major
>
> We are seeing issues with the code generator when querying java bean encoded 
> data with 2 nested joins.
> {code:java}
> dsA.join(dsB, seq("id"), "full_outer").join(dsC, seq("id"), "full_outer"); 
> {code}
> will generate invalid code in the code generator.  And can depending on the 
> data used generate stack traces like:
> {code:java}
>  Caused by: java.lang.NullPointerException
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.wholestagecodegen_findNextJoinRows_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> Or:
> {code:java}
>  Caused by: java.lang.AssertionError: index (2) should < 2
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.assertIndexIsValid(UnsafeRow.java:118)
>         at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.isNullAt(UnsafeRow.java:315)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.smj_consumeFullOuterJoinRow_0$(Unknown
>  Source)
>         at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown
>  Source)
>         at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> {code}
> When we look at the generated code we see that the code generator seems to be 
> mixing up parameters.  For example:
> {code:java}
> if (smj_leftOutputRow_0 != null) {                          //<==== null 
> check for wrong/left parameter
>   boolean smj_isNull_12 = smj_rightOutputRow_0.isNullAt(1); //<==== causes 
> NPE on right parameter here{code}
> It is as if the the nesting of 2 full outer joins is confusing the code 
> generator and as such generating invalid code.
> There is one other strange thing.  We found this issue when using data sets 
> which were using the java bean encoder.  We tried to reproduce this in the 
> spark shell or using scala case classes but were unable to do so. 
> We made a reproduction scenario as unit tests (one for each of the stacktrace 
> above) on the spark code base and made it available as a [pull 
> request|https://github.com/apache/spark/pull/41688] to this case.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to