Hi, I have been trying to figure out how to ship a python package that I have been working on, and this has brought up a couple questions to me. Please note that I'm fairly new to python package management, so any feedback/corrections is welcome =)
It looks like the --py-files support we have merely adds the .py, .zip, or .egg to the sys.path, and therefore only supports "built" distributions that only needed to be added to the path. Because of this, it looks like wheels won't work as well, since they involve an installation process ( https://www.python.org/dev/peps/pep-0427/#is-it-possible-to-import-python-code-directly-from-a-wheel-file ). In addition, any type of distribution that has shared libraries, such as pandas and numpy wheels will fail because "ZIP import of dynamic modules (.pyd, .so) is disallowed" (https://docs.python.org/2/library/zipimport.html ). The only way to support wheels or other types of source distributions that require an "installation" step, is to use an installer like pip, in which case, the natural extension is to use virtualenv. Have we considered having pyspark manage virtualenvs, and to use pip install to install packages that are sent across the cluster? I feel like first class support of using pip install will - allow us to ship packages that require an install step (numpy, pandas, etc) - help users not have to provision the cluster with all the dependencies - allow multiple applications run with different environments at the same time - allow a user just to specify a top level dependency or requirements.txt, and have pip install all the transitive dependencies automatically Thanks! Justin