Santosh Balasubramanya created SPARK-16682:
----------------------------------------------

             Summary: pyspark 1.6.0 not handling multiple level import when the 
necessary files are zipped
                 Key: SPARK-16682
                 URL: https://issues.apache.org/jira/browse/SPARK-16682
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 1.6.0
         Environment: Spark StandAlone
RedHat Linux
Mac
            Reporter: Santosh Balasubramanya


In Spark Standalone mode (1.6.0), while executing both batch and streaming jobs 
are not able to pick up dependencies packaged in zip format and added using 
"addPyFile".
 The dependency python files are modularized and placed in hireachial folder 
structure.

from workflow import di
from workflow import cache

this kind of above imports fail and even tried including the imports in each of 
the functions which are called in map and foreach functions. Tried the option 
given in the below link 
(http://stackoverflow.com/questions/27644525/pyspark-py-files-doesnt-work)

Detailed error code below


Job aborted due to stage failure: Task 1 in stage 6718.0 failed 4 times, most 
recent failure: Lost task 1.3 in stage 6718.0 (TID 7287, 10.131.66.63): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
    command = pickleSer._read_with_length(infile)
  File 
"/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
    return self.loads(obj)
  File 
"/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 422, in loads
    return pickle.loads(obj)
  File 
"/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py",
 line 653, in subimport
    __import__(name)
ImportError: ('No module named workflow.datainterface', <function subimport at 
0x925c08>, ('workflow.datainterface',))

        at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at 
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to