[ https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15363899#comment-15363899 ]
Semet commented on SPARK-16367: ------------------------------- Yes and you have artifactory that has and automatic mirroring capability of pypi. Works pretty well. > Wheelhouse Support for PySpark > ------------------------------ > > Key: SPARK-16367 > URL: https://issues.apache.org/jira/browse/SPARK-16367 > Project: Spark > Issue Type: New Feature > Components: Deploy, PySpark > Affects Versions: 1.6.1, 1.6.2, 2.0.0 > Reporter: Semet > Labels: newbie, python, python-wheel, wheelhouse > Original Estimate: 168h > Remaining Estimate: 168h > > *Rational* > Is it recommended, in order to deploying Scala packages written in Scala, to > build big fat jar files. This allows to have all dependencies on one package > so the only "cost" is copy time to deploy this file on every Spark Node. > On the other hand, Python deployment is more difficult once you want to use > external packages, and you don't really want to mess with the IT to deploy > the packages on the virtualenv of each nodes. > *Previous approaches* > I based the current proposal over the two following bugs related to this > point: > - SPARK-6764 ("Wheel support for PySpark") > - SPARK-13587("Support virtualenv in PySpark") > First part of my proposal was to merge, in order to support wheels install > and virtualenv creation > *Uber Fat Wheelhouse for Python Deployment* > In Python, the packaging standard is now "wheels", which goes further that > old good ".egg" files. With a wheel file (".whl"), the package is already > prepared for a given architecture. You can have several wheel, each specific > to an architecture, or environment. > The {{pip}} tools now how to select the package matching the current system, > how to install this package in a light speed. Said otherwise, package that > requires compilation of a C module, for instance, does *not* compile anything > when installing from wheel file. > {{pip}} also provides the ability to generate easily all wheel of all > packages used for a given module (inside a "virtualenv"). This is called > "wheelhouse". You can even don't mess with this compilation and retrieve it > directly from pypi.python.org. > *Developer workflow* > Here is, in a more concrete way, my proposal for on Pyspark developers point > of view: > - you are writing a PySpark script that increase in term of size and > dependencies. Deploying on Spark for example requires to build numpy or > Theano and other dependencies > - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script > into a standard Python package: > -- write a {{requirements.txt}}. I recommend to specify all package version. > You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the > requirements.txt > {code} > astroid==1.4.6 # via pylint > autopep8==1.2.4 > click==6.6 # via pip-tools > colorama==0.3.7 # via pylint > enum34==1.1.6 # via hypothesis > findspark==1.0.0 # via spark-testing-base > first==2.0.1 # via pip-tools > hypothesis==3.4.0 # via spark-testing-base > lazy-object-proxy==1.2.2 # via astroid > linecache2==1.0.0 # via traceback2 > pbr==1.10.0 > pep8==1.7.0 # via autopep8 > pip-tools==1.6.5 > py==1.4.31 # via pytest > pyflakes==1.2.3 > pylint==1.5.6 > pytest==2.9.2 # via spark-testing-base > six==1.10.0 # via astroid, pip-tools, pylint, unittest2 > spark-testing-base==0.0.7.post2 > traceback2==1.4.0 # via unittest2 > unittest2==1.1.0 # via spark-testing-base > wheel==0.29.0 > wrapt==1.10.8 # via astroid > {code} > -- write a setup.py with some entry points or package. Use > [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of > maitaining a setup.py files really easy > -- create a virtualenv if not already in one: > {code} > virtualenv env > {code} > -- Work on your environment, define the requirement you need in > {{requirements.txt}}, do all the {{pip install}} you need. > - create the wheelhouse for your current project > {code} > pip install wheelhouse > pip wheel . --wheel-dir wheelhouse > {code} > This can take some times, but at the end you have all the .whl required *for > your current system* > - zip it into a {{wheelhouse.zip}}. > Note that you can have your own package (for instance 'my_package') be > generated into a wheel and so installed by {{pip}} automatically. > Now comes the time to submit the project: > {code} > bin/spark-submit --master master --deploy-mode client --files > /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip > --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py > {code} > You can see that: > - no extra argument is add in the command line. All configuration goes > through {{--conf}} argument (this has been directly taken from SPARK-13587). > According to the history on spark source code, I guess the goal is to > simplify the maintainance of the various command line interface, by avoiding > too many specific argument. > - The wheelhouse deployment is triggered by the {{ --conf > "spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} > and {{wheelhouse.zip}} are copied through {{--files}}. The names of both > files can be changed through {{--conf}} arguments. I guess with a proper > documentation this might not be a problem > - you still need to define the path to {{requirement.txt}} and > {{wheelhouse.zip}} (they will be automatically copied to each node). This is > important since this will allow {{pip install}}, running of each node, to > pick only the wheels he needs. For example, if you have a package compiled on > 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will > only select the right one > - I have choosen to keep the script at the end of the command line, but for > me it is just a launcher script, it can only be 4 lines: > {code} > /#!/usr/bin/env python > from mypackage import run > run() > {code} > - on each node, a new virtualenv is created *at each deployment*. This has a > cost, but not so much, since the {{pip install}} will only install wheel, no > compilation nor internet connection will be required. The command line for > installing the wheel on each node will be like: > {code} > pip install --no-index --find-links=/path/to/node/wheelhouse -r > requirements.txt > {code} > *advantages* > - quick installation, since there is no compilation > - no Internet connectivity support, no need mess with the corporate proxy or > require a local mirroring of pypi. > - package versionning isolation (two spark job can depends on two different > version of a given library) > *disadvantages* > - creating a virtualenv at each execution takes time, not that much but still > it can take some seconds > - and disk space > - slighly more complex to setup than sending a simple python script, but this > feature is not lost > - support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but > one has to send all wheels flavours and ensure pip is able to install in > every environment. The complexity of this task is on the hands of the > developer and no more on the IT persons! (TMHO, this is an advantage) > *code submission* > I already started working on this point, starting by merging the 2 > mergerequests [#5408|https://github.com/apache/spark/pull/5408] and > [#13599|https://github.com/apache/spark/pull/13599] > I'll upload a patch asap for review. > I see two major interogations: > - I don't know that much YARN or MESOS, so I might require some help for the > final integration > - documentation should really be carefully crafted so users are not lost in > all these concepts > I really think having this "wheelhouse" support for spark will really helps > using, maintaining, and evolving Python scripts on Spark. Python has a rich > set of mature libraries Spark should do anythink to help developers easily > access and use them in their everyday job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org