HyukjinKwon opened a new pull request #28968:
URL: https://github.com/apache/spark/pull/28968


   ### What changes were proposed in this pull request?
   
   This PR proposes:
   
   1. To introduce `InheritableThread` class, that works identically with 
`threading.Thread` but it can inherit the inheritable attributes of a JVM 
thread such as `InheritableThreadLocal`.
   
       This was a problem from the pinned thread mode, see also 
https://github.com/apache/spark/pull/24898. Now it works as below:
   
       ```python
       import pyspark
   
       spark.sparkContext.setLocalProperty("a", "hi")
       def print_prop():
           print(spark.sparkContext.getLocalProperty("a"))
   
       pyspark.InheritableThread(target=print_prop).start()
       ```
   
       ```
       hi
       ```
   
   2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks 
the thread and does not close the connection from Python to JVM. In 
`InheritableThread`, it manually closes the connections when PVM garbage 
collection happens. So, JVM threads finish safely. I manually verified by 
profiling but there's also another easy way to verify:
   
       ```bash
       PYSPARK_PIN_THREAD=true ./bin/pyspark
       ```
   
       ```python
       >>> from threading import Thread
       >>> Thread(target=lambda: spark.range(1000).collect()).start()
       >>> Thread(target=lambda: spark.range(1000).collect()).start()
       >>> Thread(target=lambda: spark.range(1000).collect()).start()
       >>> spark._jvm._gateway_client.deque
       deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, 
<py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, 
<py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, 
<py4j.clientserver.ClientServerConnection object at 0x11a015358>, 
<py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
       >>> Thread(target=lambda: spark.range(1000).collect()).start()
       >>> spark._jvm._gateway_client.deque
       deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, 
<py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, 
<py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, 
<py4j.clientserver.ClientServerConnection object at 0x11a015358>, 
<py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, 
<py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
       ```
   
       This issue is fixed now.
   
   3. Because now we have a fix for the issue here, it also proposes to 
deprecate `collectWithJobGroup` which was a temporary workaround added to avoid 
this leak issue.
   
   ### Why are the changes needed?
   
   To support pinned thread mode properly without a resource leak, and a proper 
inheritable local properties.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, it adds an API `InheritableThread` class for pinned thread mode.
   
   ### How was this patch tested?
   
   Manually tested as described above, and unit test was added as well.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to