Github user jsoltren commented on the issue:
https://github.com/apache/spark/pull/17694
This seems reasonable to me.
There are some typos in the PR description. I think you meant "pickled"
instead of "picked" in a few places.
Using threading.Lock seems okay here from my admittedly limited
understanding of the deep details of Python, and my reading of
https://docs.python.org/2/library/threading.html#lock-objects.
@vundela and I chatted off thread about this some. The precise race is
this: the call to _wrap_function will define a number of broadcast variables.
In the time between when the _wrap_function call finishes and
self.ctx._jvm.PythonRDD executes, the RDD itself can be modified, perhaps
changing broadcast variables and introducing the "Broadcast variable '%s' not
loaded!" exception.
My understanding is that, due to the Global Interpreter Lock, this lock
will cause all other execution to cease while this block of code runs,
implicitly preventing any races. This is a very coarse grained lock for this
action but it is as good as we can get. (Someone please correct me if Iâm
wrong here.)
It would be good if the PR description captured some of the above
discussion.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]