Nicholas Chammas commented on SPARK-13587:


Previously, I have had reasonable success with zipping the contents of my conda 
environment in the gateway/driver node and submitting the zip file as an 
argument to --archives in the spark-submit command line. This approach works 
perfectly because it uses the existing spark infrastructure to distribute 
dependencies through to the workers. You actually don't even need anaconda 
installed on the workers since the zip can package the entire python 
installation within it. The downside of it being that conda zip files can bloat 
up quickly in a production spark application.

Can you elaborate on how you did this? I'm willing to jump through some hoops 
to create a hackish way of distributing dependencies while this JIRA task gets 
worked out.

What I'm trying is:
# Create a virtual environment and activate it.
# Pip install my requirements into that environment, as one would in a regular 
Python project.
# Zip up the venv/ folder and ship it with my application using {{--py-files}}.

I'm struggling to get the workers to pick up Python dependencies from the 
packaged venv over what's in the system site-packages. All I want is to be able 
to ship out the dependencies with the application from a virtual environment 
all at once (i.e. without having to enumerate each dependency).

Has anyone been able to do this today? It would be good to document it as a 
workaround for people until this issue is resolved.

> Support virtualenv in PySpark
> -----------------------------
>                 Key: SPARK-13587
>                 URL: https://issues.apache.org/jira/browse/SPARK-13587
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>            Reporter: Jeff Zhang
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to