[
https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361408#comment-15361408
]
Semet edited comment on SPARK-6764 at 7/5/16 9:44 AM:
------------------------------------------------------
Hello
I am working on a new proposal for complete wheel support, along with
virtualenv. I think this will solve many dependency problem with python
packages.
Full proposal is here: https://issues.apache.org/jira/browse/SPARK-16367
was (Author: [email protected]):
Hello
I am working on a new proposal for complete wheel support, along with
virtualenv. I think this will solve many dependency problem with python
packages.
> Add wheel package support for PySpark
> -------------------------------------
>
> Key: SPARK-6764
> URL: https://issues.apache.org/jira/browse/SPARK-6764
> Project: Spark
> Issue Type: Improvement
> Components: Deploy, PySpark
> Reporter: Takao Magoori
> Priority: Minor
> Labels: newbie
>
> We can do _spark-submit_ with one or more Python packages (.egg,.zip and
> .jar) by *--py-files* option.
> h4. zip packaging
> Spark put a zip file on its working directory and adds the absolute path to
> Python's sys.path. When the user program imports it,
> [zipimport|https://docs.python.org/2.7/library/zipimport.html] is
> automatically invoked under the hood. That is, data-files and dynamic
> modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and
> .pyo.
> h4. egg packaging
> Spark put an egg file on its working directory and adds the absolute path to
> Python's sys.path. Unlike zipimport, egg can handle data files and dynamid
> modules as far as the author of the package uses [pkg_resources
> API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations]
> properly. But so many python modules does not use pkg_resources API, that
> causes "ImportError"or "No such file" error. Moreover, creating eggs of
> dependencies and further dependencies are troublesome job.
> h4. wheel packaging
> Supporting new Python standard package-format
> "[wheel|https://wheel.readthedocs.org/en/latest/]" would be nice. With wheel,
> we can do spark-submit with complex dependencies simply as follows.
> 1. Write requirements.txt file.
> {noformat}
> SQLAlchemy
> MySQL-python
> requests
> simplejson>=3.6.0,<=3.6.5
> pydoop
> {noformat}
> 2. Do wheel packaging by only one command. All dependencies are wheel-ed.
> {noformat}
> $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement
> requirements.txt
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py
> {noformat}
> If your pyspark driver is a package which consists of many modules,
> 1. Write setup.py for your pyspark driver package.
> {noformat}
> from setuptools import (
> find_packages,
> setup,
> )
> setup(
> name='yourpkg',
> version='0.0.1',
> packages=find_packages(),
> install_requires=[
> 'SQLAlchemy',
> 'MySQL-python',
> 'requests',
> 'simplejson>=3.6.0,<=3.6.5',
> 'pydoop',
> ],
> )
> {noformat}
> 2. Do wheel packaging by only one command. Your driver package and all
> dependencies are wheel-ed.
> {noformat}
> your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g')
> your_driver_bootstrap.py
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]