Takao Magoori created SPARK-6764:
------------------------------------

             Summary: Add wheel package support for PySpark
                 Key: SPARK-6764
                 URL: https://issues.apache.org/jira/browse/SPARK-6764
             Project: Spark
          Issue Type: Improvement
          Components: Deploy, PySpark
            Reporter: Takao Magoori
            Priority: Minor


We can do _spark-submit_ with one or more Python packages (.egg,.zip and .jar) 
by *--py-files* option.
h4. zip packaging
Spark put a zip file on its working directory and adds the absolute path to 
Python's sys.path. When the user program imports it, 
[zipimport|https://docs.python.org/2.7/library/zipimport.html] is automatically 
invoked under the hood. That is, data-files and dynamic modules(.pyd .so) can 
not be used since zipimport supports only .py, .pyc and .pyo.

h4. egg packaging
Spark put an egg file on its working directory and adds the absolute path to 
Python's sys.path. Unlike zipimport, egg can handle data files and dynamid 
modules as far as the author of the package uses [pkg_resources 
API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations]
 properly. But so many python modules does not use pkg_resources API, that 
causes "ImportError"or "No such file" error. Moreover, creating eggs of 
dependencies and further dependencies are troublesome job.

h4. wheel packaging
Supporting new Python standard package-format 
"[wheel|https://wheel.readthedocs.org/en/latest/]"; would be nice. With wheel, 
we can do spark-submit with complex dependencies simply as follows.

1. Write requirements.txt file.
{noformat}
SQLAlchemy
MySQL-python
requests
simplejson>=3.6.0,<=3.6.5
pydoop
{noformat}

2. Do wheel packaging by only one command. All dependencies are wheel-ed.
{noformat}
$ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement 
requirements.txt
{noformat}

3. Do spark-submit
{noformat}
your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
/tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py
{noformat}

If your pyspark driver is a package which consists of many modules,

1. Write setup.py for your pyspark driver package.
{noformat}
from setuptools import (
    find_packages,
    setup,
)

setup(
    name='yourpkg',
    version='0.0.1',
    packages=find_packages(),
    install_requires=[
        'SQLAlchemy',
        'MySQL-python',
        'requests',
        'simplejson>=3.6.0,<=3.6.5',
        'pydoop',
    ],
)
{noformat}

2. Do wheel packaging by only one command. Your driver package and all 
dependencies are wheel-ed.
{noformat}
your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
{noformat}

3. Do spark-submit
{noformat}
your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
/tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') 
your_driver_bootstrap.py
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to