[
https://issues.apache.org/jira/browse/SPARK-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144974#comment-15144974
]
Sean Owen commented on SPARK-13303:
-----------------------------------
Agree, I think this is one of those big "known issues", that Pyspark really
requires several dependencies to exist on all Python installations on the
cluster. I don't know if Spark itself can fix this. At least better errors
would be nice, but I'm also not sure how to fix that. Document it more
prominently?
> Spark fails with pandas import error when pandas is not explicitly imported
> by user
> -----------------------------------------------------------------------------------
>
> Key: SPARK-13303
> URL: https://issues.apache.org/jira/browse/SPARK-13303
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 1.6.0
> Environment: The python installation used by the driver (edge node)
> has pandas installed on it, while on the data nodes pandas do not have pandas
> installed in the python runtimes used. Pandas is never explicitly imported by
> pi.py.
> Reporter: Juliet Hougland
>
> Running `spark-submit pi.py` results in:
> File
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/worker.py",
> line 98, in main
> command = pickleSer._read_with_length(infile)
> File
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
> line 164, in _read_with_length
> return self.loads(obj)
> File
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
> line 422, in loads
> return pickle.loads(obj)
> ImportError: No module named pandas.algos
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
> at
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> This is unexpected and hard for users to unravel why they may see this error,
> as they themselves have not explicitly done anything with pandas.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]