Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/1680#issuecomment-50712114
  
    If you want to test this out, here's a neat experiment you can run in an 
interactive `pyspark` shell (try `local[k]` for different _k_):
    
    ```python
    from time import sleep
    def wait(x):
        sleep(10) # 10 seconds
        return x
    
    sc.parallelize(range(10000)).mapPartitions(wait).count()
    ```
    
    Prior to this patch, when I ran this with `local[3]`, I see a peak of 13 
Python processes running: one for the driver, one for the daemon, eight for the 
daemon's forked processes (since my laptop has eight cores), and three for the 
actual workers.  After the tasks finish, I still have 10 processes since the 
daemon's pool of forked processes remain alive.  These processes aren't 
terribly expensive, since they just sit idle while waiting to `accept()` on the 
shared listening socket, but they're unnecessary.
    
    After applying this patch, I see a peak of 5 Python processes running, one 
for the python driver, one for `daemon.py`, and three for the workers, and only 
the driver and daemon remain running after the tasks finish.
    
    This patch may have broken task cancellation; I'd appreciate help in 
designing a regression test suite that checks for timely shutdown of Python 
workers when jobs are cancelled.  I'm considering an approach where we do 
something similar to this `sleep()` test in order to avoid races between 
workers completing and the job being cancelled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to