GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/20151
[SPARK-22959][PYTHON] Configuration to select the modules for daemon and
worker in PySpark
## What changes were proposed in this pull request?
We are now forced to use `pyspark/daemon.py` and `pyspark/worker.py` in
PySpark.
This doesn't allow a custom modification for it. For example, it's
sometimes hard to debug what happens inside Python worker processes.
This is actually related with SPARK-7721 too as somehow Coverage is unable
to detect the coverage from `os.fork`. If we have some custom fixes to force
the coverage, it works fine.
This is also related with SPARK-20368. This JIRA describes Sentry support
which (roughly) needs some changes within worker side.
With this configuration advanced users will be able to do a lot of
pluggable workarounds and we can meet such potential needs in the future.
As an example, let's say if I configure the module `coverage_daemon` and
had `coverage_daemon.py` in the python path:
```python
import os
from pyspark import daemon
if "COVERAGE_PROCESS_START" in os.environ:
from pyspark.worker import main
def _cov_wrapped(*args, **kwargs):
import coverage
cov = coverage.coverage(
config_file=os.environ["COVERAGE_PROCESS_START"])
cov.start()
try:
main(*args, **kwargs)
finally:
cov.stop()
cov.save()
daemon.worker_main = _cov_wrapped
if __name__ == '__main__':
daemon.manager()
```
More importantly, we can leave the main code intact but allow some
workarounds.
## How was this patch tested?
Manually tested.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark configuration-daemon-worker
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20151.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20151
----
commit f74df4b566594152fa1efe1e3fb6033cbcf3993b
Author: hyukjinkwon <gurwls223@...>
Date: 2018-01-04T12:39:56Z
Configuration to select the modules for daemon and worker in PySpark
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]