Santosh Balasubramanya created SPARK-16682:
----------------------------------------------
Summary: pyspark 1.6.0 not handling multiple level import when the
necessary files are zipped
Key: SPARK-16682
URL: https://issues.apache.org/jira/browse/SPARK-16682
Project: Spark
Issue Type: Bug
Components: PySpark
Affects Versions: 1.6.0
Environment: Spark StandAlone
RedHat Linux
Mac
Reporter: Santosh Balasubramanya
In Spark Standalone mode (1.6.0), while executing both batch and streaming jobs
are not able to pick up dependencies packaged in zip format and added using
"addPyFile".
The dependency python files are modularized and placed in hireachial folder
structure.
from workflow import di
from workflow import cache
this kind of above imports fail and even tried including the imports in each of
the functions which are called in map and foreach functions. Tried the option
given in the below link
(http://stackoverflow.com/questions/27644525/pyspark-py-files-doesnt-work)
Detailed error code below
Job aborted due to stage failure: Task 1 in stage 6718.0 failed 4 times, most
recent failure: Lost task 1.3 in stage 6718.0 (TID 7287, 10.131.66.63):
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File
"/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
line 98, in main
command = pickleSer._read_with_length(infile)
File
"/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
line 164, in _read_with_length
return self.loads(obj)
File
"/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
line 422, in loads
return pickle.loads(obj)
File
"/home/forty2/analytics/eltkhome/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/cloudpickle.py",
line 653, in subimport
__import__(name)
ImportError: ('No module named workflow.datainterface', <function subimport at
0x925c08>, ('workflow.datainterface',))
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]