HyukjinKwon opened a new pull request #28968:
URL: https://github.com/apache/spark/pull/28968
### What changes were proposed in this pull request?
This PR proposes:
1. To introduce `InheritableThread` class, that works identically with
`threading.Thread` but it can inherit the inheritable attributes of a JVM
thread such as `InheritableThreadLocal`.
This was a problem from the pinned thread mode, see also
https://github.com/apache/spark/pull/24898. Now it works as below:
```python
import pyspark
spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
print(spark.sparkContext.getLocalProperty("a"))
pyspark.InheritableThread(target=print_prop).start()
```
```
hi
```
2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks
the thread and does not close the connection from Python to JVM. In
`InheritableThread`, it manually closes the connections when PVM garbage
collection happens. So, JVM threads finish safely. I manually verified by
profiling but there's also another easy way to verify:
```bash
PYSPARK_PIN_THREAD=true ./bin/pyspark
```
```python
>>> from threading import Thread
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>,
<py4j.clientserver.ClientServerConnection object at 0x119fc9b70>,
<py4j.clientserver.ClientServerConnection object at 0x119fc9e10>,
<py4j.clientserver.ClientServerConnection object at 0x11a015358>,
<py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>,
<py4j.clientserver.ClientServerConnection object at 0x119fc9b70>,
<py4j.clientserver.ClientServerConnection object at 0x119fc9e10>,
<py4j.clientserver.ClientServerConnection object at 0x11a015358>,
<py4j.clientserver.ClientServerConnection object at 0x119fc08d0>,
<py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
```
This issue is fixed now.
3. Because now we have a fix for the issue here, it also proposes to
deprecate `collectWithJobGroup` which was a temporary workaround added to avoid
this leak issue.
### Why are the changes needed?
To support pinned thread mode properly without a resource leak, and a proper
inheritable local properties.
### Does this PR introduce _any_ user-facing change?
Yes, it adds an API `InheritableThread` class for pinned thread mode.
### How was this patch tested?
Manually tested as described above, and unit test was added as well.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]