GitHub user vundela opened a pull request:

    https://github.com/apache/spark/pull/17722

    [SPARK-12717][PYSPARK][BRANCH-1.6] Resolving race condition with pyspark 
broadcasts when using multiple threads

    
    
    ## What changes were proposed in this pull request?
    
    In pyspark when multiple threads are used, broadcast variables are picked 
with wrong PythonRDD wrap functions which leads to the following 
exception(Because of the race condition between the threads on java side with 
py4j).
    
    16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
    org.apache.spark.api.python.PythonException: Traceback (most recent call 
last):
    File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
    command = pickleSer._read_with_length(infile)
    File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
    return self.loads(obj)
    File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 422, in loads
    return pickle.loads(obj)
    File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py",
 line 39, in _from_id
    raise Exception("Broadcast variable '%s' not loaded!" % bid)
    Exception: (Exception("Broadcast variable '6' not loaded!",), <function 
_from_id at 0xce7a28>, (6L,))
    
    at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
    at 
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
    
    This change will fix the race condition in branch-1.6 by making sure that 
broadcast variables are picked with same pythonRDD function.
    
    ## How was this patch tested?
    Reproduced the issue mentioned in SPARK-12717, following the instructions 
specified in jira
    Make sure that issue is fixed with the changes
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vundela/spark SPARK-12717_spark1.6

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17722.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17722
    
----
commit af8e6d762a5ce481ab5cb0a385660e2766bee195
Author: Srinivasa Reddy Vundela <[email protected]>
Date:   2017-04-21T18:31:51Z

    [SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts 
when using multiple threads

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to