[jira] [Commented] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

Anurag Agarwal (JIRA) Tue, 21 Nov 2017 01:45:01 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260481#comment-16260481
 ]


Anurag Agarwal commented on SPARK-18859:
----------------------------------------

Another workaround for prostgres databases can be to make dblink and query via 
dblink, because at the end you need to mention the target schema. This avoids 
confusion for the postgres driver and helps it identify nullability properly 
from a query.
I tried the approach by [~mentegy] and it works too, but somehow I was not 
having create view permission on source database and it made me look around for 
another approach.

> Catalyst codegen does not mark column as nullable when it should. Causes NPE
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-18859
>                 URL: https://issues.apache.org/jira/browse/SPARK-18859
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 2.0.2
>            Reporter: Mykhailo Osypov
>
> When joining two tables via LEFT JOIN, columns in right table may be NULLs, 
> however catalyst codegen cannot recognize it.
> Example:
> {code:title=schema.sql|borderStyle=solid}
> create table masterdata.testtable(
>   id int not null,
>   age int
> );
> create table masterdata.jointable(
>   id int not null,
>   name text not null
> );
> {code}
> {code:title=query_to_select.sql|borderStyle=solid}
> (select t.id, t.age, j.name from masterdata.testtable t left join 
> masterdata.jointable j on t.id = j.id) as testtable;
> {code}
> {code:title=master code|borderStyle=solid}
> val df = sqlContext
>       .read
>       .format("jdbc")
>       .option("dbTable", "query to select")
>       ....
>       .load
> //df generated schema
> /*
> root
>  |-- id: integer (nullable = false)
>  |-- age: integer (nullable = true)
>  |-- name: string (nullable = false)
> */
> {code}
> {code:title=Codegen|borderStyle=solid}
> /* 038 */       scan_rowWriter.write(0, scan_value);
> /* 039 */
> /* 040 */       if (scan_isNull1) {
> /* 041 */         scan_rowWriter.setNullAt(1);
> /* 042 */       } else {
> /* 043 */         scan_rowWriter.write(1, scan_value1);
> /* 044 */       }
> /* 045 */
> /* 046 */       scan_rowWriter.write(2, scan_value2);
> {code}
> Since *j.name* is from right table of *left join* query, it may be null. 
> However generated schema doesn't think so (probably because it defined as 
> *name text not null*)
> {code:title=StackTrace|borderStyle=solid}
> java.lang.NullPointerException
>       at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>       at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>       at 
> org.apache.spark.sql.execution.debug.package$DebugExec$$anonfun$3$$anon$1.hasNext(package.scala:146)
>       at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1763)
>       at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>       at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
>       at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>       at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1899)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>       at org.apache.spark.scheduler.Task.run(Task.scala:86)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18859) Catalyst codegen does not mark column as nullable when it should. Causes NPE

Reply via email to