[ 
https://issues.apache.org/jira/browse/SPARK-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144974#comment-15144974
 ] 

Sean Owen commented on SPARK-13303:
-----------------------------------

Agree, I think this is one of those big "known issues", that Pyspark really 
requires several dependencies to exist on all Python installations on the 
cluster. I don't know if Spark itself can fix this. At least better errors 
would be nice, but I'm also not sure how to fix that. Document it more 
prominently?

> Spark fails with pandas import error when pandas is not explicitly imported 
> by user
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-13303
>                 URL: https://issues.apache.org/jira/browse/SPARK-13303
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.6.0
>         Environment: The python installation used by the driver (edge node) 
> has pandas installed on it, while on the data nodes pandas do not have pandas 
> installed in the python runtimes used. Pandas is never explicitly imported by 
> pi.py.
>            Reporter: Juliet Hougland
>
> Running `spark-submit pi.py` results in:
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/worker.py",
>  line 98, in main
>     command = pickleSer._read_with_length(infile)
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 164, in _read_with_length
>     return self.loads(obj)
>   File 
> "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py",
>  line 422, in loads
>     return pickle.loads(obj)
> ImportError: No module named pandas.algos
>       at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138)
>       at 
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179)
>       at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>       at org.apache.spark.scheduler.Task.run(Task.scala:88)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> This is unexpected and hard for users to unravel why they may see this error, 
> as they themselves have not explicitly done anything with pandas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to