GitHub user vundela opened a pull request:

    https://github.com/apache/spark/pull/17694

    [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcas…

    …ts when using multiple threads
    
    ## What changes were proposed in this pull request?
    
    In pyspark when multiple threads are used, broadcast variables are picked 
with wrong PythonRDD wrap functions which leads to the following 
exception(Because of the race condition between the threads on java side with 
py4j).
    
    16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
      File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
        command = pickleSer._read_with_length(infile)
      File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
        return self.loads(obj)
      File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 422, in loads
        return pickle.loads(obj)
      File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py",
 line 39, in _from_id
        raise Exception("Broadcast variable '%s' not loaded!" % bid)
    Exception: (Exception("Broadcast variable '6' not loaded!",), <function 
_from_id at 0xce7a28>, (6L,))
    
        at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at 
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
    
    This change will fix the race condition by making sure that broadcast 
variables are picked with same pythonRDD function.
    
    ## How was this patch tested?
    1) Reproduced the issue mentioned in SPARK-12717, following the 
instructions specified in jira
    2) Make sure that issue is fixed with the changes.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vundela/spark SPARK-12717

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17694.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17694
    
----
commit 4670678ba4fa7f2a7f66bc9309716fcefb05c54d
Author: Srinivasa Reddy Vundela <[email protected]>
Date:   2017-04-20T01:37:12Z

    [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts 
when using multiple threads

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to