[ 
https://issues.apache.org/jira/browse/SPARK-22959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22959:
---------------------------------
    Description: 
We are now forced to use {{pyspark/daemon.py}} and {{pyspark/worker.py}} in 
PySpark tests.

This doesn't allow a custom modification for it and it's sometimes hard to 
debug what happens in Python worker processes.

This is actually related with SPARK-7721 too as somehow Coverage is unable to 
detect the coverage from {{os.fork}}. If we have some custom fixes to force the 
coverage, it works fine.

This is also related with SPARK-20368. This JIRA describes Sentry supports 
which (roughly) needs some changes within worker side.  With this simple 
workaround, advanced users will be able to do a lot of pluggable workarounds.

As an example, let's say if I configure the module {{coverage_daemon}} and had 
{{coverage_daemon.py}} in the python path:

{code}
import os

from pyspark import daemon


if "COVERAGE_PROCESS_START" in os.environ:
    from pyspark.worker import main

    def _cov_wrapped(*args, **kwargs):
        import coverage
        cov = coverage.coverage(
            config_file=os.environ["COVERAGE_PROCESS_START"])
        cov.start()
        try:
            main(*args, **kwargs)
        finally:
            cov.stop()
            cov.save()
    daemon.worker_main = _cov_wrapped


if __name__ == '__main__':
    daemon.manager()
{code}

I can leave the main code intact but do some workarounds.

  was:
We are now forced to use {{pyspark/daemon.py}} and {{pyspark/worker.py}} in 
PySpark tests.

This doesn't allow a custom modification for it and it's sometimes hard to 
debug what happens in Python worker processes.

This is actually related with SPARK-7721 too as somehow Coverage is unable to 
detect the coverage from {{os.fork}}. If we have some custom fixes to force the 
coverage, it works fine.

This is also related with SPARK-20368. This JIRA describes Sentry supports 
which (roughly) needs some changes within worker side.  With this simple 
workaround, advanced users will be able to do a lot of pluggable workarounds.


> Configuration to select the modules for daemon and worker in PySpark
> --------------------------------------------------------------------
>
>                 Key: SPARK-22959
>                 URL: https://issues.apache.org/jira/browse/SPARK-22959
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.3.0
>            Reporter: Hyukjin Kwon
>
> We are now forced to use {{pyspark/daemon.py}} and {{pyspark/worker.py}} in 
> PySpark tests.
> This doesn't allow a custom modification for it and it's sometimes hard to 
> debug what happens in Python worker processes.
> This is actually related with SPARK-7721 too as somehow Coverage is unable to 
> detect the coverage from {{os.fork}}. If we have some custom fixes to force 
> the coverage, it works fine.
> This is also related with SPARK-20368. This JIRA describes Sentry supports 
> which (roughly) needs some changes within worker side.  With this simple 
> workaround, advanced users will be able to do a lot of pluggable workarounds.
> As an example, let's say if I configure the module {{coverage_daemon}} and 
> had {{coverage_daemon.py}} in the python path:
> {code}
> import os
> from pyspark import daemon
> if "COVERAGE_PROCESS_START" in os.environ:
>     from pyspark.worker import main
>     def _cov_wrapped(*args, **kwargs):
>         import coverage
>         cov = coverage.coverage(
>             config_file=os.environ["COVERAGE_PROCESS_START"])
>         cov.start()
>         try:
>             main(*args, **kwargs)
>         finally:
>             cov.stop()
>             cov.save()
>     daemon.worker_main = _cov_wrapped
> if __name__ == '__main__':
>     daemon.manager()
> {code}
> I can leave the main code intact but do some workarounds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to