[ 
https://issues.apache.org/jira/browse/SPARK-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-9313.
---------------------------------
    Resolution: Incomplete

> Enable a "docker run" invocation in place of PYSPARK_PYTHON
> -----------------------------------------------------------
>
>                 Key: SPARK-9313
>                 URL: https://issues.apache.org/jira/browse/SPARK-9313
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>         Environment: Linux
>            Reporter: thom neale
>            Priority: Minor
>              Labels: bulk-closed
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> There's a potentially high-yield improvement that might be possible by 
> enabling people to set PYSPARK_PYTHON (or possibly a new env var) to a docker 
> run of a specific docker image. I'm interesting in taking a shot at this, but 
> could use some pointers on overall pyspark architecture in order to avoid 
> hurting myself or trying something stupid that won't work. 
> History of this idea: I handle most of the spark infrastructure for 
> MassMutual's data science team, and we currently push code updates out to 
> spark workers with a combination of git post-recieve hooks and ansible 
> playbooks, all glued together with jenkins. It works well, but every time 
> someone wants a specific PYSPARK_PYTHON environment with precise branch 
> checkouts, for example, it has to be exquisitely configured in advance. What 
> would be amazing is if we could run a docker image in place of 
> PYSPARK_PYTHON, so people could build an image with whatever they want on it, 
> push it to a docker registry, then as long as the spark worker nodes had a 
> docker daemon running, they wouldn't need the images in advance--they would 
> just pull the built images from the registry on the fly once someone 
> submitted their job and specified the appropriate docker fu in place of 
> PYSPARK_PYTHON. This would basically make the distribution of code to the 
> workers self-service as long as users were savvy with docker. A lesser 
> benefit is that the layered filesystem feature of docker would solve the 
> (it's not really a problem) minor issue of a profusion of python virtualenvs, 
> each loaded with a huge ML stack plus other deps, from gobbling up gigs of 
> space on smaller code partitions on our workers. Each new combination of 
> branch checkouts for our application code could use the same huge ML base 
> image, and things would just be faster and simpler. 
> What I Speculate This Would Require 
> --------------------------------------------------- 
> Based on a reading of pyspark/daemon.py, I think this would require: 
> - somehow making the os.setpgid call inside manager() optional. The 
> pyspark.daemon process isn't allowed to call setpgid, I think because it has 
> pid 1 in the container. In my hacked branch I'm going this by checking if a 
> new environment variable is set. 
> - instead of binding to a random port, if the worker is dockerized, bind to a 
> predetermined port 
> - When the dockerized worker is invoked, query docker for the exposed port on 
> the host, and print that instead - Possibly do the same with ports opened by 
> forked workers? 
> - Forward stdin/out to/from the container where appropriate 
> My initial tinkering has done the first three points on 1.3.1 and I get the 
> InvalidArgumentException with an out-of-range port number, probably 
> indicating something is hitting an error a printing something else instead of 
> the actual port. 
> Any pointers people can supply would most welcome; I'm really interested in 
> at least succeeding in a demonstration of this hack, if not getting it merged 
> any time soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to