GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/20151

    [SPARK-22959][PYTHON] Configuration to select the modules for daemon and 
worker in PySpark

    ## What changes were proposed in this pull request?
    
    We are now forced to use `pyspark/daemon.py` and `pyspark/worker.py` in 
PySpark.
    
    This doesn't allow a custom modification for it. For example, it's 
sometimes hard to debug what happens inside Python worker processes.
    
    This is actually related with SPARK-7721 too as somehow Coverage is unable 
to detect the coverage from `os.fork`. If we have some custom fixes to force 
the coverage, it works fine.
    
    This is also related with SPARK-20368. This JIRA describes Sentry support 
which (roughly) needs some changes within worker side. 
    
    With this configuration advanced users will be able to do a lot of 
pluggable workarounds and we can meet such potential needs in the future.
    
    As an example, let's say if I configure the module `coverage_daemon` and 
had `coverage_daemon.py` in the python path:
    
    ```python
    import os
    
    from pyspark import daemon
    
    
    if "COVERAGE_PROCESS_START" in os.environ:
        from pyspark.worker import main
    
        def _cov_wrapped(*args, **kwargs):
            import coverage
            cov = coverage.coverage(
                config_file=os.environ["COVERAGE_PROCESS_START"])
            cov.start()
            try:
                main(*args, **kwargs)
            finally:
                cov.stop()
                cov.save()
        daemon.worker_main = _cov_wrapped
    
    
    if __name__ == '__main__':
        daemon.manager()
    ```
    
    More importantly, we can leave the main code intact but allow some 
workarounds.
    
    ## How was this patch tested?
    
    Manually tested.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark configuration-daemon-worker

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20151.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20151
    
----
commit f74df4b566594152fa1efe1e3fb6033cbcf3993b
Author: hyukjinkwon <gurwls223@...>
Date:   2018-01-04T12:39:56Z

    Configuration to select the modules for daemon and worker in PySpark

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to