[ 
https://issues.apache.org/jira/browse/SPARK-40199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634084#comment-17634084
 ] 

Apache Spark commented on SPARK-40199:
--------------------------------------

User 'xkrogen' has created a pull request for this issue:
https://github.com/apache/spark/pull/38660

> Spark throws NPE without useful message when NULL value appears in non-null 
> schema
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-40199
>                 URL: https://issues.apache.org/jira/browse/SPARK-40199
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.2
>            Reporter: Erik Krogen
>            Priority: Major
>
> Currently in some cases, if Spark encounters a NULL value where the schema 
> indicates that the column/field should be non-null, it will throw a 
> {{NullPointerException}} with no message and thus no way to debug further. 
> This can happen, for example, if you have a UDF which is erroneously marked 
> as {{asNonNullable()}}, or if you read input data where the actual values 
> don't match the schema (which could happen e.g. with Avro if the reader 
> provides a schema declaring non-null although the data was written with null 
> values).
> As an example of how to reproduce:
> {code:scala}
>     val badUDF = spark.udf.register[String, Int]("bad_udf", in => 
> null).asNonNullable()
>     Seq(1, 2).toDF("c1").select(badUDF($"c1")).collect()
> {code}
> This throws an exception like:
> {code}
> Driver stacktrace:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1) (xxxxxxxxxx executor driver): java.lang.NullPointerException
>       at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>       at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
>       at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
>       at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
>       at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
>       at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>       at org.apache.spark.scheduler.Task.run(Task.scala:139)
>       at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
>       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1490)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {code}
> As a user, it is very confusing -- it looks like there is a bug in Spark. We 
> have had many users report such problems, and though we can guide them to a 
> schema-data mismatch, there is no indication of what field might contain the 
> bad values, so a laborious data exploration process is required to find and 
> remedy it.
> We should provide a better error message in such cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to