Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/2624#issuecomment-57853572
It's worth noting that the ThreadLocals haven't seemed to cause problems in
any of the existing uses of Spark / PySpark. In PySpark Streaming, I think
we're running into a scenario that's something like this:
- Java invokes a Python callback through the Py4J callback server.
Internally, the callback server uses some thread pool.
- The Python callback calls back into Java through Py4J.
- Somewhere along the line, `SparkEnv.set()` is called, leaking the current
SparkEnv into one of the Py4J GatewayServer or CallbackServer pool threads.
- This thread is re-used when a new Python SparkContext is created using
the same GatewayServer.
I thought of another fix that will allow the ThreadLocals to work: add a
mutable field to SparkEnv instances that records whether that environment is
associated with a SparkContext that's been stopped. In SparkEnv.get(), we can
check this field to determine whether to return the ThreadLocal or return
lastSparkEnv. This approach is more confusing / complex than removing the
ThreadLocals, though.
I'm still strongly in favor of doing the work to confirm that SparkEnv is
currently used as though it's a global object and then removing the
ThreadLocals.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]