[ https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362015#comment-15362015 ]
Takao Magoori commented on SPARK-6764: -------------------------------------- Sorry. It seems there is no isolated site-package directory. Workdir is just added to sys.path. > Add wheel package support for PySpark > ------------------------------------- > > Key: SPARK-6764 > URL: https://issues.apache.org/jira/browse/SPARK-6764 > Project: Spark > Issue Type: Improvement > Components: Deploy, PySpark > Reporter: Takao Magoori > Priority: Minor > Labels: newbie > > We can do _spark-submit_ with one or more Python packages (.egg,.zip and > .jar) by *--py-files* option. > h4. zip packaging > Spark put a zip file on its working directory and adds the absolute path to > Python's sys.path. When the user program imports it, > [zipimport|https://docs.python.org/2.7/library/zipimport.html] is > automatically invoked under the hood. That is, data-files and dynamic > modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and > .pyo. > h4. egg packaging > Spark put an egg file on its working directory and adds the absolute path to > Python's sys.path. Unlike zipimport, egg can handle data files and dynamid > modules as far as the author of the package uses [pkg_resources > API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations] > properly. But so many python modules does not use pkg_resources API, that > causes "ImportError"or "No such file" error. Moreover, creating eggs of > dependencies and further dependencies are troublesome job. > h4. wheel packaging > Supporting new Python standard package-format > "[wheel|https://wheel.readthedocs.org/en/latest/]" would be nice. With wheel, > we can do spark-submit with complex dependencies simply as follows. > 1. Write requirements.txt file. > {noformat} > SQLAlchemy > MySQL-python > requests > simplejson>=3.6.0,<=3.6.5 > pydoop > {noformat} > 2. Do wheel packaging by only one command. All dependencies are wheel-ed. > {noformat} > $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement > requirements.txt > {noformat} > 3. Do spark-submit > {noformat} > your_spark_home/bin/spark-submit --master local[4] --py-files $(find > /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py > {noformat} > If your pyspark driver is a package which consists of many modules, > 1. Write setup.py for your pyspark driver package. > {noformat} > from setuptools import ( > find_packages, > setup, > ) > setup( > name='yourpkg', > version='0.0.1', > packages=find_packages(), > install_requires=[ > 'SQLAlchemy', > 'MySQL-python', > 'requests', > 'simplejson>=3.6.0,<=3.6.5', > 'pydoop', > ], > ) > {noformat} > 2. Do wheel packaging by only one command. Your driver package and all > dependencies are wheel-ed. > {noformat} > your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/. > {noformat} > 3. Do spark-submit > {noformat} > your_spark_home/bin/spark-submit --master local[4] --py-files $(find > /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') > your_driver_bootstrap.py > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org