[
https://issues.apache.org/jira/browse/SPARK-2292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046832#comment-14046832
]
Bharath Ravi Kumar edited comment on SPARK-2292 at 6/28/14 12:10 PM:
---------------------------------------------------------------------
I got to the likely reason for this behavior (though that doesn't get to the
root cause). The following is the build + launch combination that is
problematic:
1) Spark dependencies *not marked* as provided in the maven pom (hence bundled
into the jar).
2) The master URL, uber jar location passed to the spark context initialized
within the application.
3) The application is run as "maven exec:java ..." (note that this is expected
to preserve the classpath while the uber jar location is separately passed to
the sparkcontext in the code.)
While the rest of the application worked just fine, the mapToPair failed with
an NPE.
The "right" usage turned out to be what is recommended in the docs:
1) Build the project with spark dependencies marked as provided
2) Don't specify the jar location & other deployment opts inside the app
3) Submit the application through spark-submit.
This time, the entire execution goes through flawlessly.
Since the former approach appears theoretically similar to the latter, the
reason for the NPE in the former case still merits further investigation, imo.
Meanwhile, a documentation update is required to mandate that only spark-submit
be used to launch apps.
I've updated the earlier code sample and maven pom with further simplification.
I'd suggest creating a new standalone maven app to reproduce the issue.
was (Author: reachbach):
I got to the likely reason for this behavior (though that doesn't get to the
root cause). The following is the build + launch combination that is
problematic:
1) Spark dependencies *not marked* as provided in the maven pom (hence bundled
into the jar).
2) The master URL, uber jar location passed to the spark context initialized
within the application.
3) The application is run as "maven exec:java ..." (note that this is expected
to preserve the classpath while the uber jar location is separately passed to
the sparkcontext in the code.)
While the rest of the application worked just fine, the mapToPair failed with
an NPE.
The "right" usage turned out to be what is recommended in the docs:
1) Build the project with spark dependencies marked as provided
2) Don't specify the jar location & other deployment opts inside the app
3) Submit the application through spark-submit.
This time, the entire execution goes through flawlessly.
Since the former approach appears theoretically similar to the latter, the
reason for the NPE in the former case still merits further investigation, imo.
Meanwhile, a documentation update is required to mandate that only spark-submit
be used to launch apps.
I've updated the earlier code sample and maven pom with further simplification.
I'd suggest creating a new standalone maven app to reproduce the issue.
> NullPointerException in JavaPairRDD.mapToPair
> ---------------------------------------------
>
> Key: SPARK-2292
> URL: https://issues.apache.org/jira/browse/SPARK-2292
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 1.0.0
> Environment: Spark 1.0.0, Standalone with the master & single slave
> running on Ubuntu on a laptop. 4G mem and 8 cores were available to the
> executor .
> Reporter: Bharath Ravi Kumar
> Priority: Critical
> Attachments: SPARK-2292-aash-repro.tar.gz
>
>
> Correction: Invoking JavaPairRDD.mapToPair results in an NPE:
> {noformat}
> 14/06/26 21:05:35 WARN scheduler.TaskSetManager: Loss was due to
> java.lang.NullPointerException
> java.lang.NullPointerException
> at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:750)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> This occurs only after migrating to the 1.0.0 API. The details of the code
> the data file used to test are included in this gist :
> https://gist.github.com/reachbach/d8977c8eb5f71f889301
--
This message was sent by Atlassian JIRA
(v6.2#6252)