Paul Finkelshteyn created SPARK-31854:
-----------------------------------------

             Summary: Different results of query execution with wholestage 
codegen on and off
                 Key: SPARK-31854
                 URL: https://issues.apache.org/jira/browse/SPARK-31854
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: Paul Finkelshteyn


Preface: I'm creating Kotlin API for spark to take best parts from three worlds 
— spark scala, spark java and kotlin.

What is nice — it works in most scenarios.

But i've hit following cornercase:
{code:scala}
withSpark(props = mapOf("spark.sql.codegen.wholeStage" to true)) {
    dsOf(1, null, 2)
            .map { c(it) }
            .debugCodegen()
            .show()
}
{code}

c(it) is creation of unnamed tuple

It fails with exception
{code}
java.lang.NullPointerException: Null value appeared in non-nullable field:
top level Product or row object
If the schema is inferred from a Scala tuple/case class, or a Java bean, please 
try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer 
instead of int/scala.Int).
{code}

I know, in Scala it won't work, so I could stop here. But it works in Kotlin if 
I turn wholestage codegen off!

Moreover, if we will dig into generated code (when wholestage codegen is on), 
we'll see that basically flow is following:
If one of elements in source dataset was null we wil throw NPE no matter what.

Flow is as follows:

{code}
private void serializefromobject_doConsume_0(org.jetbrains.spark.api.Arity1 
serializefromobject_expr_0_0, boolean serializefromobject_exprIsNull_0_0) 
throws java.io.IOException {
    serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1);
        mapelements_isNull_1 = mapelements_resultIsNull_0;
            mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0;
                private void mapelements_doConsume_0(java.lang.Integer 
mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws 
java.io.IOException {
                    mapelements_doConsume_0(deserializetoobject_value_0, 
deserializetoobject_isNull_0);
                        deserializetoobject_resultIsNull_0 = 
deserializetoobject_exprIsNull_0_0;
                            private void 
deserializetoobject_doConsume_0(InternalRow localtablescan_row_0, int 
deserializetoobject_expr_0_0, boolean deserializetoobject_exprIsNull_0_0) 
throws java.io.IOException {
                                
deserializetoobject_doConsume_0(localtablescan_row_0, localtablescan_value_0, 
localtablescan_isNull_0);
                                    boolean localtablescan_isNull_0 = 
localtablescan_row_0.isNullAt(0);
        mapelements_isNull_1 = true;
{code}

You can find generated code in it's original view and slightly simplified and 
refacored version 
[here|https://gist.github.com/asm0dey/5c0fa4c985ab999b383d16257b515100]

I believe that Spark should not behave differently when wholestage codegen is 
on and off and differences in behavior look like a bug.

My Spark version is 3.0.0-preview2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to