[
https://issues.apache.org/jira/browse/SPARK-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-9313.
---------------------------------
Resolution: Incomplete
> Enable a "docker run" invocation in place of PYSPARK_PYTHON
> -----------------------------------------------------------
>
> Key: SPARK-9313
> URL: https://issues.apache.org/jira/browse/SPARK-9313
> Project: Spark
> Issue Type: New Feature
> Components: PySpark
> Environment: Linux
> Reporter: thom neale
> Priority: Minor
> Labels: bulk-closed
> Original Estimate: 0.05h
> Remaining Estimate: 0.05h
>
> There's a potentially high-yield improvement that might be possible by
> enabling people to set PYSPARK_PYTHON (or possibly a new env var) to a docker
> run of a specific docker image. I'm interesting in taking a shot at this, but
> could use some pointers on overall pyspark architecture in order to avoid
> hurting myself or trying something stupid that won't work.
> History of this idea: I handle most of the spark infrastructure for
> MassMutual's data science team, and we currently push code updates out to
> spark workers with a combination of git post-recieve hooks and ansible
> playbooks, all glued together with jenkins. It works well, but every time
> someone wants a specific PYSPARK_PYTHON environment with precise branch
> checkouts, for example, it has to be exquisitely configured in advance. What
> would be amazing is if we could run a docker image in place of
> PYSPARK_PYTHON, so people could build an image with whatever they want on it,
> push it to a docker registry, then as long as the spark worker nodes had a
> docker daemon running, they wouldn't need the images in advance--they would
> just pull the built images from the registry on the fly once someone
> submitted their job and specified the appropriate docker fu in place of
> PYSPARK_PYTHON. This would basically make the distribution of code to the
> workers self-service as long as users were savvy with docker. A lesser
> benefit is that the layered filesystem feature of docker would solve the
> (it's not really a problem) minor issue of a profusion of python virtualenvs,
> each loaded with a huge ML stack plus other deps, from gobbling up gigs of
> space on smaller code partitions on our workers. Each new combination of
> branch checkouts for our application code could use the same huge ML base
> image, and things would just be faster and simpler.
> What I Speculate This Would Require
> ---------------------------------------------------
> Based on a reading of pyspark/daemon.py, I think this would require:
> - somehow making the os.setpgid call inside manager() optional. The
> pyspark.daemon process isn't allowed to call setpgid, I think because it has
> pid 1 in the container. In my hacked branch I'm going this by checking if a
> new environment variable is set.
> - instead of binding to a random port, if the worker is dockerized, bind to a
> predetermined port
> - When the dockerized worker is invoked, query docker for the exposed port on
> the host, and print that instead - Possibly do the same with ports opened by
> forked workers?
> - Forward stdin/out to/from the container where appropriate
> My initial tinkering has done the first three points on 1.3.1 and I get the
> InvalidArgumentException with an out-of-range port number, probably
> indicating something is hitting an error a printing something else instead of
> the actual port.
> Any pointers people can supply would most welcome; I'm really interested in
> at least succeeding in a demonstration of this hack, if not getting it merged
> any time soon.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]