[
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15364972#comment-15364972
]
Jeff Zhang commented on SPARK-16367:
------------------------------------
[[email protected]] I still don't understand how the binary wheel work in
different OS machines since you build wheel on the client machine.
> Wheelhouse Support for PySpark
> ------------------------------
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
> Issue Type: New Feature
> Components: Deploy, PySpark
> Affects Versions: 1.6.1, 1.6.2, 2.0.0
> Reporter: Semet
> Labels: newbie, python, python-wheel, wheelhouse
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to
> build big fat jar files. This allows to have all dependencies on one package
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use
> external packages, and you don't really want to mess with the IT to deploy
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install
> and virtualenv creation
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark*
> In Python, the packaging standard is now the "wheels" file format, which goes
> further that good old ".egg" files. With a wheel file (".whl"), the package
> is already prepared for a given architecture. You can have several wheels for
> a given package version, each specific to an architecture, or environment.
> For example, look at https://pypi.python.org/pypi/numpy all the different
> version of Wheel available.
> The {{pip}} tools knows how to select the right wheel file matching the
> current system, and how to install this package in a light speed (without
> compilation). Said otherwise, package that requires compilation of a C
> module, for instance "numpy", does *not* compile anything when installing
> from wheel file.
> {{pypi.pypthon.org}} already provided wheels for major python version. It the
> wheel is not available, pip will compile it from source anyway. Mirroring of
> Pypi is possible through projects such as http://doc.devpi.net/latest/
> (untested) or the Pypi mirror support on Artifactory (tested personnally).
> {{pip}} also provides the ability to generate easily all wheels of all
> packages used for a given project which is inside a "virtualenv". This is
> called "wheelhouse". You can even don't mess with this compilation and
> retrieve it directly from pypi.python.org.
> *Use Case 1: no internet connectivity*
> Here my first proposal for a deployment workflow, in the case where the Spark
> cluster does not have any internet connectivity or access to a Pypi mirror.
> In this case the simplest way to deploy a project with several dependencies
> is to build and then send to complete "wheelhouse":
> - you are writing a PySpark script that increase in term of size and
> dependencies. Deploying on Spark for example requires to build numpy or
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version.
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the
> requirements.txt
> {code}
> astroid==1.4.6 # via pylint
> autopep8==1.2.4
> click==6.6 # via pip-tools
> colorama==0.3.7 # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0 # via spark-testing-base
> first==2.0.1 # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2 # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0 # via autopep8
> pip-tools==1.6.5
> py==1.4.31 # via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0 # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for
> your current system* in a directory {{wheelhouse}}.
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit --master master --deploy-mode client --files
> /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip
> --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
> {code}
> You can see that:
> - no extra argument is add in the command line. All configuration goes
> through {{--conf}} argument (this has been directly taken from SPARK-13587).
> According to the history on spark source code, I guess the goal is to
> simplify the maintainance of the various command line interface, by avoiding
> too many specific argument.
> - The wheelhouse deployment is triggered by the {{\-\-conf
> "spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}}
> and {{wheelhouse.zip}} are copied through {{--files}}. The names of both
> files can be changed through {{\-\-conf}} arguments. I guess with a proper
> documentation this might not be a problem
> - you still need to define the path to {{requirement.txt}} and
> {{wheelhouse.zip}} (they will be automatically copied to each node). This is
> important since this will allow {{pip install}}, running of each node, to
> pick only the wheels he needs. For example, if you have a package compiled on
> 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will
> only select the right one
> - I have choosen to keep the script at the end of the command line, but for
> me it is just a launcher script, it can only be 4 lines:
> {code}
> /#!/usr/bin/env python
> from mypackage import run
> run()
> {code}
> - on each node, a new virtualenv is created *at each deployment*. This has a
> cost, but not so much, since the {{pip install}} will only install wheel, no
> compilation nor internet connection will be required. The command line for
> installing the wheel on each node will be like:
> {code}
> pip install --no-index --find-links=/path/to/node/wheelhouse -r
> requirements.txt
> {code}
> *advantages*
> - quick installation, since there is no compilation
> - no Internet connectivity support, no need mess with the corporate proxy or
> require a local mirroring of pypi.
> - package versionning isolation (two spark job can depends on two different
> version of a given library)
> *disadvantages*
> - creating a virtualenv at each execution takes time, not that much but still
> it can take some seconds
> - and disk space
> - slighly more complex to setup than sending a simple python script, but this
> feature is not lost
> - support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but
> one has to send all wheels flavours and ensure pip is able to install in
> every environment. The complexity of this task is on the hands of the
> developer and no more on the IT persons! (TMHO, this is an advantage)
> *Use Case 2: the Spark cluster has access to Pypi or a mirror of Pypi*
> This is the more elegant situation. The Spark cluster (each node) can install
> the dependencies of your project independently from the wheels provided by
> Pypi. Your internal dependencies and your job project can also comes in
> independent wheel files as well. In this case the workflow is much simpler:
> - Turn your project into a Python module
> - write {{requirements.txt}} and {{setup.py}} like in Use Case 1
> - create the wheel with {{pip wheels}}. But now we will not send *ALL* the
> dependencies. Only the one that are not on Pypi (current job project, other
> internal dependencies, etc).
> - no need to create a wheelhouse. You can still copy the wheels either with
> {{--py-files}} (will be automatically installed) or inside a wheelhouse named
> {{wheelhouse.zip}}
> Deployment becomes:
> Now comes the time to submit the project:
> {code}
> bin/spark-submit --master master --deploy-mode client --files
> /path/to/project/requirements.txt --py-files
> /path/to/project/internal_dependency_1.whl,/path/to/project/internal_dependency_2.whl,/path/to/project/current_project.whl
> --conf "spark.pyspark.virtualenv.enabled=true" --conf
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"
> ~/path/to/launcher_script.py
> {code}
> or with a wheelhouse that only contains internal dependencies and current
> project wheels:
> {code}
> bin/spark-submit --master master --deploy-mode client --files
> /path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf
> "spark.pyspark.virtualenv.enabled=true" --conf
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"
> ~/path/to/launcher_script.py
> {code}
> or if you want to use the official Pypi or have configured {{pip.conf}} to
> hit the internal pypi mirror (see doc bellow):
> {code}
> bin/spark-submit --master master --deploy-mode client --files
> /path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf
> "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
> {code}
> On each node, the deployment will be done with a command such as:
> {code}
> pip install --index-url http://pypi.mycompany.com
> --find-links=/path/to/node/wheelhouse -r requirements.txt
> {code}
> Note:
> - {{\-\-conf
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} allows to
> specify a Pypi mirror, for example a mirror internal to your company network.
> If not provided, the default Pypi mirror (pypi.python.org) will be requested
> - to send a wheelhouse, use {{\-\-files}}. To send individual wheels, use
> {{\-\-py-files}}. With the latter, all wheels will be installed. For multiple
> architecture cluster, prepare all needed wheels for all architecture and use
> a wheelhouse archive, this allows {{pip}} to choose the right version of the
> wheel automatically.
> *code submission*
> I already started working on this point, starting by merging the 2
> mergerequests [#5408|https://github.com/apache/spark/pull/5408] and
> [#13599|https://github.com/apache/spark/pull/13599]
> I'll upload a patch asap for review.
> I see two major interogations:
> - I don't know that much YARN or MESOS, so I might require some help for the
> final integration
> - documentation should really be carefully crafted so users are not lost in
> all these concepts
> I really think having this "wheelhouse" support for spark will really helps
> using, maintaining, and evolving Python scripts on Spark. Python has a rich
> set of mature libraries Spark should do anythink to help developers easily
> access and use them in their everyday job.
> *Important notes about some complex package such as numpy*
> Numpy is the kind of package that take several minutes to deploy and we want
> to avoid having all nodes install it each time. Pypi provides several
> precompiled wheel but it may occurs that the wheel are not right for your
> platform or the platform fo your cluster.
> Wheels are *not* cached for pip version < 7.0. From pip v7.0 and +, wheel are
> automatically cached when built (if needed), so the first installation might
> take some time, but after the installation will be straight forward.
> On most of my machines, numpy is installed without any compilation thanks to
> wheels
> *Certificate*
> pip does not use system ssl certificate. If you use a local pypi mirror
> behind https with internal certificate, you'll have to setup pypi correctly
> with the following content in {{~/.pip/pip.conf}}:
> {code}
> [global]
> cert = /path/to/your/internal/certificates.pem
> {code}
> First creation might take some times, but pip will automatically cache the
> wheel for your system in {{~/.cache/pip/wheels}}. You can of course recreate
> the wheel with {{pip wheel}} or find the wheel in {{~/.cache/pip/wheels}}.
> You can use {{pip -v install numpy}} to see where it has placed the wheel in
> cache.
> If you use Artifactory, you can upload your wheels at a local, central cache
> that can be shared accross all your slave. See [this
> documentation|https://www.jfrog.com/confluence/display/RTF/PyPI+Repositories#PyPIRepositories-LocalRepositories]
> to see how this works. This way, you can insert wheels in this local cache
> and it will be seens as if it has been uploaded to the official repository
> (local cache + remote cache can be "merged" into a virtual repository with
> artifactory)
> *Set use of internal pypi mirror*
> Ask your IT to update the {{~/.pip/pip.conf}} of the node to point by default
> to the internal mirror:
> {code}
> [global]
> ; Low timeout
> timeout = 20
> index-url = https://<user>:<pass>@pypi.mycompany.org/
> {code}
> Now, no more need to specify the {{\-\-conf
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} in your
> Spark submit command line
> Note: this will not work when installing package with {{python setup.py
> install}} syntax. In this case you need to update {{~/.pypirc}} and use the
> {{-r}} argument. This syntax is not used in spark-submit
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]