Semet created SPARK-16367:
-----------------------------
Summary: Wheelhouse Support for PySpark
Key: SPARK-16367
URL: https://issues.apache.org/jira/browse/SPARK-16367
Project: Spark
Issue Type: Improvement
Components: Deploy, PySpark
Affects Versions: 1.6.2, 1.6.1, 2.0.0
Reporter: Semet
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to
build big fat jar files. This allows to have all dependencies on one package so
the only "cost" is copy time to deploy this file on every Spark Node.
On the other hand, Python deployment is more difficult once you want to use
external packages, and you don't really want to mess with the IT to deploy the
packages on the virtualenv of each nodes.
*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
So here is my proposal:
*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old
good ".egg" files. With a wheel file (".whl"), the package is already prepared
for a given architecture. You can have several wheel, each specific to an
architecture, or environment.
The {{pip}} tools now how to select the package matching the current system,
how to install this package in a light speed. Said otherwise, package that
requires compilation of a C module, for instance, does *not* compile anything
when installing from wheel file.
{{pip}} also provides the ability to generate easily all wheel of all packages
used for a given module (inside a "virtualenv"). This is called "wheelhouse".
You can even don't mess with this compilation and retrieve it directly from
pypi.python.org.
*Developer workflow*
Here is, in a more concrete way, how my proposal will be for developers:
- you are writing a PySpark script that increase in term of size and
dependencies. Deploying on Spark for example requires to build numpy or Theano
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script
into a standard Python package:
-- write a requirements.txt.
-- write a setup.py. Use [PBR|http://docs.openstack.org/developer/pbr/] it
makes the jobs of maitaining a setup.py files really easy
-- use [pip-tools|https://github.com/nvie/pip-tools] to maintain the
requirements.txt
-- create a virtualenv if not already:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
- zip it into a {{wheelhouse.zip}}.
Note that you can have your own package (for instance 'my_package') be
generated into a wheel and so installed by {{pip}} automatically.
Now comes the time to submit the project:
{code}
bin/spark-submit --master master --deploy-mode client --conf
"spark.pyspark.virtualenv.enabled=true" --conf
"spark.pyspark.virtualenv.type=native" --conf
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt"
--conf "spark.pyspark.virtualenv.bin.path=virtualenv"
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"
~/path/to/launcher_script.py
{code}
You can see that:
- no extra argument is add in the command line. All configuration goes through
{{--conf}} argument (this has been directly taken from SPARK-13587). According
to the history on spark source code, I guess the goal is to simplify the
maintainance of the various command line interface, by avoiding too many
specific argument.
- the command line is pretty complex indeed. I guess with a proper
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and
{{wheelhouse.zip}} (they will be automatically copied to each node). This is
important since this will allow {{pip install}}, running of each node, to pick
only the wheels he needs. For example, if you have a package compiled on 32
bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will only
select the right one
- I have choosen to keep the script at the end of the command line, but for me
it is just a launcher script, it can only be 4 lines:
{code}
/#!/usr/bin/env python
from mypackage import run
run()
{code}
*advantages*
- quick installation, since there is no compilation
- no Internet connectivity support, no need mess with the corporate proxy or
require a local mirroring of pypi.
- package versionning isolation (two spark job can depends on two different
version of a given library)
*disadvantages*
- slighly more complex to setup than sending a simple python script, but this
feature is not lost
- support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but
one has to send all wheels flavours and ensure pip is able to install in every
environment. The complexity of this task is on the hands of the developer and
no more on the IT persons! (TMHO, this is an advantage)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]