[ https://issues.apache.org/jira/browse/SPARK-13303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15557230#comment-15557230 ]
holdenk commented on SPARK-13303: --------------------------------- What about if we added a requirements file? We have one for our dev tools - having one for PySpark its self should be pretty reasonable. > Spark fails with pandas import error when pandas is not explicitly imported > by user > ----------------------------------------------------------------------------------- > > Key: SPARK-13303 > URL: https://issues.apache.org/jira/browse/SPARK-13303 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.6.0 > Environment: The python installation used by the driver (edge node) > has pandas installed on it, while on the data nodes pandas do not have pandas > installed in the python runtimes used. Pandas is never explicitly imported by > pi.py. > Reporter: Juliet Hougland > > Running `spark-submit pi.py` results in: > File > "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", > line 98, in main > command = pickleSer._read_with_length(infile) > File > "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 164, in _read_with_length > return self.loads(obj) > File > "/opt/cloudera/parcels/CDH-5.5.0-1.cdh5.5.0.p0.8/lib/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 422, in loads > return pickle.loads(obj) > ImportError: No module named pandas.algos > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:138) > at > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:179) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:97) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > This is unexpected and hard for users to unravel why they may see this error, > as they themselves have not explicitly done anything with pandas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org