[
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362764#comment-15362764
]
Jeff Zhang commented on SPARK-16367:
------------------------------------
[[email protected]] Thanks for the new idea, this makes the community more
powerful.
Here's my few concerns and comments.
bq. support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but
one has to send all wheels flavours and ensure pip is able to install in every
environment.
What about different OS ? Can Wheels compiled on the client machine be used on
a different OS ? This is my largest concern for this approach.
bq. and disk space
For yarn, this is not a problem, because the container will be cleanup after
the container is exited.
I am not sure whether the extra steps for creating wheelhouse is too
complicated for users.
BTW in the approach of SPARK-13587, I specified the cache-dir when creating
virtualenv. That means only the first time compilation is needed, after that
each installation will pick up the wheel file in cache dir.
> Wheelhouse Support for PySpark
> ------------------------------
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
> Issue Type: New Feature
> Components: Deploy, PySpark
> Affects Versions: 1.6.1, 1.6.2, 2.0.0
> Reporter: Semet
> Labels: newbie, python, python-wheel, wheelhouse
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to
> build big fat jar files. This allows to have all dependencies on one package
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use
> external packages, and you don't really want to mess with the IT to deploy
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that
> old good ".egg" files. With a wheel file (".whl"), the package is already
> prepared for a given architecture. You can have several wheel, each specific
> to an architecture, or environment.
> The {{pip}} tools now how to select the package matching the current system,
> how to install this package in a light speed. Said otherwise, package that
> requires compilation of a C module, for instance, does *not* compile anything
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all
> packages used for a given module (inside a "virtualenv"). This is called
> "wheelhouse". You can even don't mess with this compilation and retrieve it
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point
> of view:
> - you are writing a PySpark script that increase in term of size and
> dependencies. Deploying on Spark for example requires to build numpy or
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version.
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the
> requirements.txt
> {code}
> astroid==1.4.6 # via pylint
> autopep8==1.2.4
> click==6.6 # via pip-tools
> colorama==0.3.7 # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0 # via spark-testing-base
> first==2.0.1 # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2 # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0 # via autopep8
> pip-tools==1.6.5
> py==1.4.31 # via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0 # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit --master master --deploy-mode client --files
> /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip
> --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
> {code}
> You can see that:
> - no extra argument is add in the command line. All configuration goes
> through {{--conf}} argument (this has been directly taken from SPARK-13587).
> According to the history on spark source code, I guess the goal is to
> simplify the maintainance of the various command line interface, by avoiding
> too many specific argument.
> - The wheelhouse deployment is triggered by the {{ --conf
> "spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}}
> and {{wheelhouse.zip}} are copied through {{--files}}. The names of both
> files can be changed through {{--conf}} arguments. I guess with a proper
> documentation this might not be a problem
> - you still need to define the path to {{requirement.txt}} and
> {{wheelhouse.zip}} (they will be automatically copied to each node). This is
> important since this will allow {{pip install}}, running of each node, to
> pick only the wheels he needs. For example, if you have a package compiled on
> 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will
> only select the right one
> - I have choosen to keep the script at the end of the command line, but for
> me it is just a launcher script, it can only be 4 lines:
> {code}
> /#!/usr/bin/env python
> from mypackage import run
> run()
> {code}
> - on each node, a new virtualenv is created *at each deployment*. This has a
> cost, but not so much, since the {{pip install}} will only install wheel, no
> compilation nor internet connection will be required. The command line for
> installing the wheel on each node will be like:
> {code}
> pip install --no-index --find-links=/path/to/node/wheelhouse -r
> requirements.txt
> {code}
> *advantages*
> - quick installation, since there is no compilation
> - no Internet connectivity support, no need mess with the corporate proxy or
> require a local mirroring of pypi.
> - package versionning isolation (two spark job can depends on two different
> version of a given library)
> *disadvantages*
> - creating a virtualenv at each execution takes time, not that much but still
> it can take some seconds
> - and disk space
> - slighly more complex to setup than sending a simple python script, but this
> feature is not lost
> - support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but
> one has to send all wheels flavours and ensure pip is able to install in
> every environment. The complexity of this task is on the hands of the
> developer and no more on the IT persons! (TMHO, this is an advantage)
> *code submission*
> I already started working on this point, starting by merging the 2
> mergerequests [#5408|https://github.com/apache/spark/pull/5408] and
> [#13599|https://github.com/apache/spark/pull/13599]
> I'll upload a patch asap for review.
> I see two major interogations:
> - I don't know that much YARN or MESOS, so I might require some help for the
> final integration
> - documentation should really be carefully crafted so users are not lost in
> all these concepts
> I really think having this "wheelhouse" support for spark will really helps
> using, maintaining, and evolving Python scripts on Spark. Python has a rich
> set of mature libraries Spark should do anythink to help developers easily
> access and use them in their everyday job.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]