[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

Jeff Zhang (JIRA) Tue, 05 Jul 2016 09:54:24 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362764#comment-15362764
 ]


Jeff Zhang commented on SPARK-16367:
------------------------------------

[[email protected]] Thanks for the new idea, this makes the community more 
powerful. 
Here's my few concerns and comments.

bq. support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but 
one has to send all wheels flavours and ensure pip is able to install in every 
environment.
What about different OS ? Can Wheels compiled on the client machine be used on 
a different OS ? This is my largest concern for this approach. 

bq. and disk space
For yarn, this is not a problem, because the container will be cleanup after 
the container is exited.

I am not sure whether the extra steps for creating wheelhouse is too 
complicated for users. 
BTW in the approach of SPARK-13587, I specified the cache-dir when creating 
virtualenv. That means only the first time compilation is needed, after that 
each installation will pick up the wheel file in cache dir. 


> Wheelhouse Support for PySpark
> ------------------------------
>
>                 Key: SPARK-16367
>                 URL: https://issues.apache.org/jira/browse/SPARK-16367
>             Project: Spark
>          Issue Type: New Feature
>          Components: Deploy, PySpark
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Semet
>              Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6            # via pylint
> autopep8==1.2.4
> click==6.6                # via pip-tools
> colorama==0.3.7           # via pylint
> enum34==1.1.6             # via hypothesis
> findspark==1.0.0          # via spark-testing-base
> first==2.0.1              # via pip-tools
> hypothesis==3.4.0         # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0         # via traceback2
> pbr==1.10.0
> pep8==1.7.0               # via autopep8
> pip-tools==1.6.5
> py==1.4.31                # via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2             # via spark-testing-base
> six==1.10.0               # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0         # via unittest2
> unittest2==1.1.0          # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8             # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for 
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit  --master master --deploy-mode client --files 
> /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip 
> --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
> {code}
> You can see that:
> - no extra argument is add in the command line. All configuration goes 
> through {{--conf}} argument (this has been directly taken from SPARK-13587). 
> According to the history on spark source code, I guess the goal is to 
> simplify the maintainance of the various command line interface, by avoiding 
> too many specific argument.
> - The wheelhouse deployment is triggered by the {{ --conf 
> "spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} 
> and {{wheelhouse.zip}} are copied through {{--files}}. The names of both 
> files can be changed through {{--conf}} arguments. I guess with a proper 
> documentation this might not be a problem
> - you still need to define the path to {{requirement.txt}} and 
> {{wheelhouse.zip}} (they will be automatically copied to each node). This is 
> important since this will allow {{pip install}}, running of each node, to 
> pick only the wheels he needs. For example, if you have a package compiled on 
> 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will 
> only select the right one
> - I have choosen to keep the script at the end of the command line, but for 
> me it is just a launcher script, it can only be 4 lines:
> {code}
> /#!/usr/bin/env python        
> from mypackage import run
> run()
> {code}
> - on each node, a new virtualenv is created *at each deployment*. This has a 
> cost, but not so much, since the {{pip install}} will only install wheel, no 
> compilation nor internet connection will be required. The command line for 
> installing the wheel on each node will be like: 
> {code}
> pip install --no-index --find-links=/path/to/node/wheelhouse -r 
> requirements.txt
> {code}
> *advantages*
> - quick installation, since there is no compilation
> - no Internet connectivity support, no need mess with the corporate proxy or 
> require a local mirroring of pypi.
> - package versionning isolation (two spark job can depends on two different 
> version of a given library)
> *disadvantages*
> - creating a virtualenv at each execution takes time, not that much but still 
> it can take some seconds
> - and disk space
> - slighly more complex to setup than sending a simple python script, but this 
> feature is not lost
> - support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but 
> one has to send all wheels flavours and ensure pip is able to install in 
> every environment. The complexity of this task is on the hands of the 
> developer and no more on the IT persons! (TMHO, this is an advantage)
> *code submission*
> I already started working on this point, starting by merging the 2 
> mergerequests [#5408|https://github.com/apache/spark/pull/5408] and 
> [#13599|https://github.com/apache/spark/pull/13599]
> I'll upload a patch asap for review.
> I see two major interogations:
> - I don't know that much YARN or MESOS, so I might require some help for the 
> final integration
> - documentation should really be carefully crafted so users are not lost in 
> all these concepts
> I really think having this "wheelhouse" support for spark will really helps 
> using, maintaining, and evolving Python scripts on Spark. Python has a rich 
> set of mature libraries Spark should do anythink to help developers easily 
> access and use them in their everyday job.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

Reply via email to