[ 
https://issues.apache.org/jira/browse/SPARK-4851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4851:
------------------------------
             Priority: Minor  (was: Major)
          Description: 
*Reproduction:*

{code}
class A:
    @staticmethod
    def foo(self, x):
        return x

sc.parallelize([1]).map(lambda x: A.foo(x)).count()
{code}

This gives

{code}
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 
3, localhost): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
  File "/Users/joshrosen/Documents/Spark/python/pyspark/worker.py", line 107, 
in main
    process()
  File "/Users/joshrosen/Documents/Spark/python/pyspark/worker.py", line 98, in 
process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, in 
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, in 
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, in 
pipeline_func
    return func(split, prev_func(split, iterator))
  File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 247, in 
func
    return f(iterator)
  File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 818, in 
<lambda>
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 818, in 
<genexpr>
    return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "<stdin>", line 1, in <lambda>
RuntimeError: uninitialized staticmethod object

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136)
        at 
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:173)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:56)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745){code}

  was:
class A:
    @staticmethod
    def calc_something(self, x):
        return x*x
rdd = sc.parallelize(xrange(1,1000),60)
rdd.count()
rdd.map(lambda x:A.calc_something(x))

error: 
RuntimeError: uninitialized staticmethod object

          Environment:     (was: mac Os )
    Affects Version/s:     (was: 1.0.0)
                       1.3.0
                       1.0.2
                       1.2.0
               Labels:   (was: newbie)
              Summary: "Uninitialized staticmethod object" error in PySpark  
(was: can't pass a static method to rdd object in python )

I've edited this issue to include an updated reproduction and more complete 
stacktrace.

I'm pretty sure that this is due to limitations in our pickling library.  Dill, 
another Python pickling extension, also has trouble with this case and it 
doesn't appear that they have a fix either: 
https://github.com/uqfoundation/dill/issues/13

> "Uninitialized staticmethod object" error in PySpark
> ----------------------------------------------------
>
>                 Key: SPARK-4851
>                 URL: https://issues.apache.org/jira/browse/SPARK-4851
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.0.2, 1.2.0, 1.3.0
>            Reporter: Nadav Grossug
>            Priority: Minor
>
> *Reproduction:*
> {code}
> class A:
>     @staticmethod
>     def foo(self, x):
>         return x
> sc.parallelize([1]).map(lambda x: A.foo(x)).count()
> {code}
> This gives
> {code}
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 
> (TID 3, localhost): org.apache.spark.api.python.PythonException: Traceback 
> (most recent call last):
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/worker.py", line 107, 
> in main
>     process()
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/worker.py", line 98, 
> in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, 
> in pipeline_func
>     return func(split, prev_func(split, iterator))
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, 
> in pipeline_func
>     return func(split, prev_func(split, iterator))
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 2070, 
> in pipeline_func
>     return func(split, prev_func(split, iterator))
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 247, in 
> func
>     return f(iterator)
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 818, in 
> <lambda>
>     return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "/Users/joshrosen/Documents/Spark/python/pyspark/rdd.py", line 818, in 
> <genexpr>
>     return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>   File "<stdin>", line 1, in <lambda>
> RuntimeError: uninitialized staticmethod object
>       at 
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136)
>       at 
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:173)
>       at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:264)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:231)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>       at org.apache.spark.scheduler.Task.run(Task.scala:56)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:745){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to