[
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622327#comment-16622327
]
Fabian Höring commented on SPARK-25433:
---------------------------------------
Actually it turns out this can be already achieved with the current
implementation by tuning the environment variables and spark parameters. It is
enough to generate the pex file with a generic entry point that does the
redirection to the custom module worker.py or daemon.py.
Will provide a working sample and then close this ticket.
> Add support for PEX in PySpark
> ------------------------------
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 2.2.2
> Reporter: Fabian Höring
> Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark
> executors using [PEX|https://github.com/pantsbuild/pex]
> This currently works fine with
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
> (disadvantages are that you have a separate conda package repo and ship the
> python interpreter all the time)
> Basically the workflow is
> * to zip the local conda environment ([conda
> pack|https://github.com/conda/conda-pack] also works)
> * ship it to each executor as an archive
> * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587
> ticket to provide nice entry points to spark-submit and SparkContext but
> zipping your local virtual env and then just changing PYSPARK_PYTHON env
> variable should already work.
> I also have seen this
> [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
> But recreating the virtual env each time doesn't seem to be a very scalable
> solution. If you have hundreds of executors it will retrieve the packages on
> each excecutor and recreate your virtual environment each time. Same problem
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily
> shippable to another machine. In particular there is the relocatable option
> (see
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
> which makes it very complicated for the user to ship the virtual env and be
> sure it works.
> And here is where pex comes in. It is a nice way to create a single
> executable zip file with all dependencies included. You have the pex command
> line tool to build your package and when it is built you are sure it works.
> This is in my opinion the most elegant way to ship python code (better than
> virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON
> to the pex files doesn't work. You can nevertheless tune the env variable
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
> and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]