GitHub user mn-mikke opened a pull request:

    https://github.com/apache/spark/pull/21687

    [SPARK-24165][SQL] Fixing the output data type of CaseWhen expression

    ## What changes were proposed in this pull request?
    This PR is proposing a fix for the output data type of ```CaseWhen``` 
expression. The current implementation ignores nullability of nested types from 
different execution branches and returns type of the first branch.
    
    This could lead to unwanted ```NullPointerException``` from other 
expressions depending on a CaseWhen expression.
    
    Example:
    ```
    val rows = new util.ArrayList[Row]()
    rows.add(Row(true, ("a", 1)))
    rows.add(Row(false, (null, 2)))
    val schema = StructType(Seq(
      StructField("cond", BooleanType, false),
      StructField("s", StructType(Seq(
        StructField("val1", StringType, true),
        StructField("val2", IntegerType, false)
      )), false)
    ))
    
    val df = spark.createDataFrame(rows, schema)
    
    df
      .select(when('cond, struct(lit("x").as("val1"), 
lit(10).as("val2"))).otherwise('s) as "res")
      .select('res.getField("val1"))
      .show()
    ```
    Exception:
    ```
    Exception in thread "main" java.lang.NullPointerException
        at 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:109)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at 
org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
        at 
org.apache.spark.sql.execution.LocalTableScanExec$$anonfun$unsafeRows$1.apply(LocalTableScanExec.scala:44)
    ...
    ```
    Output schema:
    ```
    root
     |-- res.val1: string (nullable = false)
    ```
    
    ## How was this patch tested?
    New test cases added into
    - DataFrameSuite.scala
    - conditionalExpressions.scala


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mn-mikke/spark SPARK-24165

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21687.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21687
    
----
commit 71040635723a4dc3bc55b4415261d5a7abf4ed50
Author: Marek Novotny <mn.mikke@...>
Date:   2018-07-01T13:36:24Z

    [SPARK-24165][SQL] Fixing the output data type of CaseWhen expression when 
resolving nullability of nested types

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to