[ https://issues.apache.org/jira/browse/SPARK-20466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16170759#comment-16170759 ]
Sahil Takiar commented on SPARK-20466: -------------------------------------- I just hit this issue in Hive-on-Spark when running some TPC-DS queries. It seems to be intermittent, re-tries of the task succeed. I have a very similar stack trace: {code} java.lang.NullPointerException at org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:364) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:238) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:211) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} The {{JobConf}} object can be {{null}} if {{HadoopRDD#getJobConf}} returns {{null}}. Looks like there is a race condition in {{#getJobConf}} [here|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L160]. The method {{HadoopRDD.containsCachedMetadata}} looks into an internal metadata cache - {{SparkEnv#hadoopJobMetadata}}. This cache uses soft references, so the JVM may reclaim entries from the map whenever there is some GC pressure. In which case, any get request on the key will return a {{null}}. The race condition is that the {{#getJobConf}} method first checks if the cache contains the key, and then retrieves. In between the {{containsKey}} and {{get}} its possible the the key is GCed by the JVM. This would cause {{#getJobConf}} to return {{null}}. The fix should be pretty simple, don't use the {{containsKey(key)}} method on the cache, just run a {{get(key)}} and check if it returns {{null}} or not. Happy to create a PR if other agrees with my analysis. > HadoopRDD#addLocalConfiguration throws NPE > ------------------------------------------ > > Key: SPARK-20466 > URL: https://issues.apache.org/jira/browse/SPARK-20466 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 2.0.2 > Reporter: liyunzhang_intel > Priority: Minor > Attachments: NPE_log > > > in spark2.0.2, it throws NPE > {code} > 17/04/23 08:19:55 ERROR executor.Executor: Exception in task 439.0 in stage > 16.0 (TID 986)$ > java.lang.NullPointerException$ > ^Iat > org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:373)$ > ^Iat org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:243)$ > ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)$ > ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)$ > ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$ > ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$ > ^Iat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)$ > ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$ > ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$ > ^Iat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)$ > ^Iat org.apache.spark.scheduler.Task.run(Task.scala:86)$ > ^Iat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)$ > ^Iat > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)$ > ^Iat > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)$ > ^Iat java.lang.Thread.run(Thread.java:745)$ > {code} > suggestion to add some code to avoid NPE > {code} > /** Add Hadoop configuration specific to a single partition and attempt. */ > def addLocalConfiguration(jobTrackerId: String, jobId: Int, splitId: Int, > attemptId: Int, > conf: JobConf) { > val jobID = new JobID(jobTrackerId, jobId) > val taId = new TaskAttemptID(new TaskID(jobID, TaskType.MAP, splitId), > attemptId) > if ( conf != null){ > conf.set("mapred.tip.id", taId.getTaskID.toString) > conf.set("mapred.task.id", taId.toString) > conf.setBoolean("mapred.task.is.map", true) > conf.setInt("mapred.task.partition", splitId) > conf.set("mapred.job.id", jobID.toString) > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org