[GitHub] spark pull request: SPARK-1004. PySpark on YARN

andrewor14 Thu, 24 Apr 2014 20:34:26 -0700

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/30#issuecomment-41356120
  
    @sryza I have tested this on a standalone cluster with success. However, I 
haven't been able to get it working on a CDH cluster. I tried building both 
with maven and SBT (the latter of which clearly doesn't work yet), but neither 
was fruitful.
    
    More specifically, I did
    
    ```
    mvn -Pyarn -Dhadoop.version=2.3.0-cdh5.0.0 -Dyarn.version=2.3.0-cdh5.0.0 
-DskipTests clean package
    MASTER=yarn-client bin/pyspark
    ```
    
    and ran into
    
    ```
    14/04/25 03:16:54 INFO CoarseGrainedExecutorBackend: Got assigned task 0
    14/04/25 03:16:55 INFO Executor: Running task ID 0
    14/04/25 03:16:56 ERROR Executor: Exception in task ID 0
    java.io.EOFException
            at java.io.DataInputStream.readInt(DataInputStream.java:392)
            at 
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:183)
            at 
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:55)
            at 
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:42)
            at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:97)
            at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:57)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
            at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
            at org.apache.spark.scheduler.Task.run(Task.scala:51)
            at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:210)
            at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43)
            at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
            at java.security.AccessController.doPrivileged(Native Method)
            at javax.security.auth.Subject.doAs(Subject.java:415)
            at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
            at 
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:42)
            at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:175)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            at java.lang.Thread.run(Thread.java:744)
    14/04/25 03:16:56 ERROR Executor: Uncaught exception in thread 
Thread[stderr reader for python,5,main]
    java.lang.NullPointerException
            at 
org.apache.spark.api.python.PythonWorkerFactory$$anon$3$$anonfun$run$3.apply$mcV$sp(PythonWorkerFactory.scala:171)
            at 
org.apache.spark.api.python.PythonWorkerFactory$$anon$3$$anonfun$run$3.apply(PythonWorkerFactory.scala:169)
            at 
org.apache.spark.api.python.PythonWorkerFactory$$anon$3$$anonfun$run$3.apply(PythonWorkerFactory.scala:169)
    ```
    
    I will spend some time digging into what the NPE is, but in the mean time 
do you see anything obvious that I'm missing?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1004. PySpark on YARN

Reply via email to