Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1680#issuecomment-50712114
If you want to test this out, here's a neat experiment you can run in an
interactive `pyspark` shell (try `local[k]` for different _k_):
```python
from time import sleep
def wait(x):
sleep(10) # 10 seconds
return x
sc.parallelize(range(10000)).mapPartitions(wait).count()
```
Prior to this patch, when I ran this with `local[3]`, I see a peak of 13
Python processes running: one for the driver, one for the daemon, eight for the
daemon's forked processes (since my laptop has eight cores), and three for the
actual workers. After the tasks finish, I still have 10 processes since the
daemon's pool of forked processes remain alive. These processes aren't
terribly expensive, since they just sit idle while waiting to `accept()` on the
shared listening socket, but they're unnecessary.
After applying this patch, I see a peak of 5 Python processes running, one
for the python driver, one for `daemon.py`, and three for the workers, and only
the driver and daemon remain running after the tasks finish.
This patch may have broken task cancellation; I'd appreciate help in
designing a regression test suite that checks for timely shutdown of Python
workers when jobs are cancelled. I'm considering an approach where we do
something similar to this `sleep()` test in order to avoid races between
workers completing and the job being cancelled.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---