[
https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16617399#comment-16617399
]
Fabian Höring commented on SPARK-25433:
---------------------------------------
[~hyukjin.kwon] I changed the description of the ticket including link to
existing attempts.
> Add support for PEX in PySpark
> ------------------------------
>
> Key: SPARK-25433
> URL: https://issues.apache.org/jira/browse/SPARK-25433
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 2.2.2
> Reporter: Fabian Höring
> Priority: Minor
>
> The goal of this ticket is to ship and use custom code inside the spark
> executors.
> This currently works fine with
> [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html]
> (disadvantages are that you have a separate conda package repo and ship the
> python interpreter all the time.)
> Basically the workflow is
> * to zip the local conda environment ([conda
> pack|https://github.com/conda/conda-pack] also works)
> * ship it to each executor as an archive
> * modify PYSPARK_PYTHON to the local conda environment
> I think it can work the same way with virtual env. There is the SPARK-13587
> ticket to provide nice entry points to spark-submit and SparkContext but
> zipping your local virtual env and then just changing the PYSPARK_PYTHON
> should already work.
> I also have seen this
> [blogpost.|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html].
> But recreating the virtual env each time doesn't seem to be a very scalable
> solution. If you have hundreds of executors it will retrieve the package from
> each excecutor and recreate your virtual environment each time. Same problem
> with this proposal SPARK-16367 from what I understood.
> Another problem with virtual env is that your local environment is not easily
> shippable to another machine. In particular there is the relocatable option
> (see
> [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable],
>
> [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)]
> which makes it very complicated for the user to ship the virtual env and be
> sure it works.
> And here is where [pex|https://github.com/pantsbuild/pex] comes in. It is a
> nice way to create a single executable zip file with all dependencies
> included. You have the pex command line tool to build your package and when
> it is built you are sure it works. This is in my opinion the most elegant way
> to ship python code (better than virtual env and conda)
> The problem why it doesn't work out of the box is that there can be only one
> single entry point. So just shipping the pex files and setting PYSPARK_PYTHON
> to the pex files doesn't work. You can nevertheless tune the env variable
> [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables]
> and runtime to provide different entry points.
> PR: [https://github.com/apache/spark/pull/22422/files]
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]