[jira] [Commented] (SPARK-9313) Enable a "docker run" invocation in place of PYSPARK_PYTHON

thom neale (JIRA) Wed, 06 Jan 2016 10:24:24 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085990#comment-15085990
 ]


thom neale commented on SPARK-9313:
-----------------------------------

[~joshrosen] There's only one reason I know of so far--

In daemon.py the manager function currently calls os.setpgid(0, 0), which will 
fail inside a linux container, where the main foregrounded process will have a 
process ID of 1 [1]. Usually the init system has pid 1, but a foregrounded 
process in a linux container will be the first process running in the 
container. The rules/semantics of changing process group ID in *nixes, which I 
don't understand well, won't allow a change to the pgid under those 
circumstances. To [~sowen]'s point, this would issue would probably arise with 
any other containerization tool too, so a docker-specific fix wouldn't make 
sense. It'd probably be good enough to put a try-except block around the 
.setpgid call and pass if it hits the expected error in this situation. When 
the container gets stopped all child processes will be killed too, so there 
isn't a need to change pgid as a means of reaping child processes in the 
container scenario. 
 
[1] Traceback when calling setpgid(0, 0) in Python 3.5

    $ docker run -it --rm --entrypoint=/bin/bash python -c "python -c 'import 
os; os.setpgid(0, 0)'"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    PermissionError: [Errno 1] Operation not permitted

And in 2.7:

    $ docker run -it --rm --entrypoint=/bin/bash python:2.7 -c "python -c 
'import os; os.setpgid(0, 0)'"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    OSError: [Errno 1] Operation not permitted



> Enable a "docker run" invocation in place of PYSPARK_PYTHON
> -----------------------------------------------------------
>
>                 Key: SPARK-9313
>                 URL: https://issues.apache.org/jira/browse/SPARK-9313
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>         Environment: Linux
>            Reporter: thom neale
>            Priority: Minor
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> There's a potentially high-yield improvement that might be possible by 
> enabling people to set PYSPARK_PYTHON (or possibly a new env var) to a docker 
> run of a specific docker image. I'm interesting in taking a shot at this, but 
> could use some pointers on overall pyspark architecture in order to avoid 
> hurting myself or trying something stupid that won't work. 
> History of this idea: I handle most of the spark infrastructure for 
> MassMutual's data science team, and we currently push code updates out to 
> spark workers with a combination of git post-recieve hooks and ansible 
> playbooks, all glued together with jenkins. It works well, but every time 
> someone wants a specific PYSPARK_PYTHON environment with precise branch 
> checkouts, for example, it has to be exquisitely configured in advance. What 
> would be amazing is if we could run a docker image in place of 
> PYSPARK_PYTHON, so people could build an image with whatever they want on it, 
> push it to a docker registry, then as long as the spark worker nodes had a 
> docker daemon running, they wouldn't need the images in advance--they would 
> just pull the built images from the registry on the fly once someone 
> submitted their job and specified the appropriate docker fu in place of 
> PYSPARK_PYTHON. This would basically make the distribution of code to the 
> workers self-service as long as users were savvy with docker. A lesser 
> benefit is that the layered filesystem feature of docker would solve the 
> (it's not really a problem) minor issue of a profusion of python virtualenvs, 
> each loaded with a huge ML stack plus other deps, from gobbling up gigs of 
> space on smaller code partitions on our workers. Each new combination of 
> branch checkouts for our application code could use the same huge ML base 
> image, and things would just be faster and simpler. 
> What I Speculate This Would Require 
> --------------------------------------------------- 
> Based on a reading of pyspark/daemon.py, I think this would require: 
> - somehow making the os.setpgid call inside manager() optional. The 
> pyspark.daemon process isn't allowed to call setpgid, I think because it has 
> pid 1 in the container. In my hacked branch I'm going this by checking if a 
> new environment variable is set. 
> - instead of binding to a random port, if the worker is dockerized, bind to a 
> predetermined port 
> - When the dockerized worker is invoked, query docker for the exposed port on 
> the host, and print that instead - Possibly do the same with ports opened by 
> forked workers? 
> - Forward stdin/out to/from the container where appropriate 
> My initial tinkering has done the first three points on 1.3.1 and I get the 
> InvalidArgumentException with an out-of-range port number, probably 
> indicating something is hitting an error a printing something else instead of 
> the actual port. 
> Any pointers people can supply would most welcome; I'm really interested in 
> at least succeeding in a demonstration of this hack, if not getting it merged 
> any time soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-9313) Enable a "docker run" invocation in place of PYSPARK_PYTHON

Reply via email to