Davies Liu created SPARK-6216:
---------------------------------

             Summary: Check Python version in worker before run PySpark job
                 Key: SPARK-6216
                 URL: https://issues.apache.org/jira/browse/SPARK-6216
             Project: Spark
          Issue Type: Improvement
            Reporter: Davies Liu


PySpark can only run with the same major version both in driver and worker ( 
both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in 
driver or 2.6 in worker (or vice).

For example:
{code}
davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 
PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark
Using Python version 2.7.7 (default, Jun  2 2014 12:48:16)
SparkContext available as sc, SQLContext available as sqlCtx.
>>> sc.textFile('LICENSE').map(lambda l: l.split()).count()
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/Users/davies/work/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/Users/davies/work/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2251, in 
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2251, in 
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 2251, in 
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 281, in func
    return f(iterator)
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 931, in <lambda>
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/Users/davies/work/spark/python/pyspark/rdd.py", line 931, in <genexpr>
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "<stdin>", line 1, in <lambda>
TypeError: 'bool' object is not callable

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136)
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:177)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to