[GitHub] incubator-zeppelin pull request: [ZEPPELIN-75]: PySpark interprete...

felixcheung Sat, 09 May 2015 22:05:37 -0700

GitHub user felixcheung opened a pull request:

    https://github.com/apache/incubator-zeppelin/pull/66


    [ZEPPELIN-75]: PySpark interpreter - useful debugging traceback information 
is lost for any error from Spark

    When there is an error from Spark, the original error is not returned as 
output in the cell, instead a generic Py4JError is shown.
    
    While it is possible to look at zeppelin-interpreter-spark-root-node.log, 
it might not be accessible for a multi-user environment as it will require 
remote access to the host running Zeppelin.
    
    Before:
    
    ```
    (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred 
while calling o45.collect.\n', JavaObject id=o46), <traceback object at 
0x7fb737ea72d8>)
    ```
    
    Almost all error from Spark will look like this with Py4JJavaError.
    
    After:
    
    ```
    Py4JJavaError: An error occurred while calling o45.collect.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
(TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback 
(most recent call last):
      File "/opt/spark-1.3.0-bin-hadoop2.4/python/pyspark/worker.py", line 101, 
in main
        process()
      File "/opt/spark-1.3.0-bin-hadoop2.4/python/pyspark/worker.py", line 96, 
in process
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/opt/spark-1.3.0-bin-hadoop2.4/python/pyspark/serializers.py", line 
236, in dump_stream
        vs = list(itertools.islice(iterator, batch))
      File "/opt/spark-1.3.0-bin-hadoop2.4/python/pyspark/rdd.py", line 735, in 
func
        initial = next(iterator)
      File "<string>", line 2, in sample
    TypeError: 'module' object is not callable
    
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    
    Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
        at scala.Option.foreach(Option.scala:236)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    
    (<class 'py4j.protocol.Py4JJavaError'>, Py4JJavaError(u'An error occurred 
while calling o45.collect.\n', JavaObject id=o46), <traceback object at 
0x7f8b45deb3f8>)
    ```

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/felixcheung/incubator-zeppelin master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-zeppelin/pull/66.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #66
    
----
commit 7a30a14fb2b3c0e41017d909ba5c1e1a0b9c545b
Author: Felix Cheung <[email protected]>
Date:   2015-03-25T18:15:53Z

    minor doc update for running on YARN

commit 65ba046bc87cf3146ae0f80a336fb6c05d4b6619
Author: Felix Cheung <[email protected]>
Date:   2015-03-28T16:56:10Z

    Merge commit 'a007a9b5f235ebd9c608a005c5243503291d94d5'

commit e89ba083c21847a0b99c07735f106561ceee122b
Author: Felix Cheung <[email protected]>
Date:   2015-03-31T20:24:16Z

    PySpark Interpreter should allow starting with a specific version of 
Python, as PySpark does.

commit dfbb458abf6aa1c61b26bd51d40083a4d9664b53
Author: Felix Cheung <[email protected]>
Date:   2015-05-08T17:19:13Z

    Merge commit 'e23f3034053fbd8b8f4eff478c372d151a42c36b'

commit a55666d20e1d01cab8eeb5d0ba85f9255cb69f2c
Author: Felix Cheung <[email protected]>
Date:   2015-05-10T04:52:42Z

    PySpark error handling improvement - return more meaningful message from 
the original error which is useful for Spark related errors

commit c4754974f0b48a098c10b4ef2094152529eb097d
Author: Felix Cheung <[email protected]>
Date:   2015-05-10T05:01:01Z

    Merge commit '956e3f74a1b2f28fd8caa25055e77f687ca8d211'
    
    Conflicts:
        spark/src/main/resources/python/zeppelin_pyspark.py

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] incubator-zeppelin pull request: [ZEPPELIN-75]: PySpark interprete...

Reply via email to