GitHub user sryza opened a pull request: https://github.com/apache/incubator-spark/pull/640
SPARK-1004: PySpark on YARN Make pyspark work in yarn-client mode. This build's on Josh's work. I tested verified it works on a 5-node cluster. You can merge this pull request into a Git repository by running: $ git pull https://github.com/sryza/incubator-spark sandy-spark-1004 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/incubator-spark/pull/640.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #640 ---- commit e752a6a1c8a9d7cbc31d7b911800e22db6fcb2b0 Author: Josh Rosen <joshro...@apache.org> Date: 2014-01-24T18:19:58Z Automatically set Yarn env vars in PySpark (SPARK-1030). commit 0adcaa971086853b254baf32748811561bb6e209 Author: Josh Rosen <joshro...@apache.org> Date: 2014-01-25T23:28:56Z WIP towards PySpark on YARN: - Remove reliance on SPARK_HOME on the workers. Only the driver should know about SPARK_HOME. On the workers, we ensure that the PySpark Python libraries are added to the PYTHONPATH. - Add a Makefile for generating a "fat zip" that contains PySpark's Python dependencies. This is a bit of a hack and I'd be open to better packaging tools, but this doesn't require any extra Python libraries. This use case doesn't seem to be well-addressed by the existing Python packaging tools: there are plenty of tools to package complete Python environments (such as pyinstaller and virtualenv) or to bundle *individual* libraries (e.g. distutils), but few to generate portable fat zips or eggs. This hasn't been tested with YARN and may not actually compile. commit d4a71d0495d072d5b5364601e7cd0dc9a7c9c9b9 Author: Josh Rosen <joshro...@apache.org> Date: 2014-02-19T06:27:21Z Add missing setup.py file for PySpark. commit dcda63863a41414ba5e410092dc4fbab2e353543 Author: Sandy Ryza <sa...@cloudera.com> Date: 2014-02-24T07:06:42Z Improvements commit 38546d4f282727f3ae112f1e564df72443b726f5 Author: Sandy Ryza <sa...@cloudera.com> Date: 2014-02-24T07:26:01Z Don't set SPARK_JAR ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. To do so, please top-post your response. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---