GitHub user vundela opened a pull request:
https://github.com/apache/spark/pull/17722
[SPARK-12717][PYSPARK][BRANCH-1.6] Resolving race condition with pyspark
broadcasts when using multiple threads
## What changes were proposed in this pull request?
In pyspark when multiple threads are used, broadcast variables are picked
with wrong PythonRDD wrap functions which leads to the following
exception(Because of the race condition between the threads on java side with
py4j).
16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
File
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
line 98, in main
command = pickleSer._read_with_length(infile)
File
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
line 164, in _read_with_length
return self.loads(obj)
File
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
line 422, in loads
return pickle.loads(obj)
File
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/broadcast.py",
line 39, in _from_id
raise Exception("Broadcast variable '%s' not loaded!" % bid)
Exception: (Exception("Broadcast variable '6' not loaded!",), <function
_from_id at 0xce7a28>, (6L,))
at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This change will fix the race condition in branch-1.6 by making sure that
broadcast variables are picked with same pythonRDD function.
## How was this patch tested?
Reproduced the issue mentioned in SPARK-12717, following the instructions
specified in jira
Make sure that issue is fixed with the changes
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vundela/spark SPARK-12717_spark1.6
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17722.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17722
----
commit af8e6d762a5ce481ab5cb0a385660e2766bee195
Author: Srinivasa Reddy Vundela <[email protected]>
Date: 2017-04-21T18:31:51Z
[SPARK-12717][PYSPARK] Resolving race condition with pyspark broadcasts
when using multiple threads
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]