First-class support for pip/virtualenv in pyspark

Justin Uang Thu, 23 Apr 2015 20:24:26 -0700

Hi,

I have been trying to figure out how to ship a python package that I have
been working on, and this has brought up a couple questions to me. Please
note that I'm fairly new to python package management, so any
feedback/corrections is welcome =)


It looks like the --py-files support we have merely adds the .py, .zip, or
.egg to the sys.path, and therefore only supports "built" distributions
that only needed to be added to the path. Because of this, it looks like
wheels won't work as well, since they involve an installation process (
https://www.python.org/dev/peps/pep-0427/#is-it-possible-to-import-python-code-directly-from-a-wheel-file
).

In addition, any type of distribution that has shared libraries, such as
pandas and numpy wheels will fail because "ZIP import of dynamic modules
(.pyd, .so) is disallowed" (https://docs.python.org/2/library/zipimport.html
).

The only way to support wheels or other types of source distributions that
require an "installation" step, is to use an installer like pip, in which
case, the natural extension is to use virtualenv. Have we considered having
pyspark manage virtualenvs, and to use pip install to install packages that
are sent across the cluster? I feel like first class support of using pip
install will

- allow us to ship packages that require an install step (numpy, pandas,
etc)
- help users not have to provision the cluster with all the dependencies
- allow multiple applications run with different environments at the same
time
- allow a user just to specify a top level dependency or requirements.txt,
and have pip install all the transitive dependencies automatically

Thanks!

Justin

First-class support for pip/virtualenv in pyspark

Reply via email to