GitHub user vanzin opened a pull request:
https://github.com/apache/spark/pull/6360
[SPARK-5479] [yarn] Handle --py-files correctly in YARN.
The bug description is a little misleading: the actual issue is that
.py files are not handled correctly when distributed by YARN. They're
added to "spark.submit.pyFiles", which, when processed by context.py,
explicitly whitelists certain extensions (see PACKAGE_EXTENSIONS),
and that does not include .py files.
On top of that, archives were not handled at all! They made it to the
driver's python path, but never made it to executors, since the mechanism
used to propagate their location (spark.submit.pyFiles) only works on
the driver side.
So, instead, ignore "spark.submit.pyFiles" and just build PYTHONPATH
correctly for both driver and executors. Individual .py files are
placed in a subdirectory of the container's local dir in the cluster,
which is then added to the python path. Archives are added directly.
The change, as a side effect, ends up solving the symptom described
in the bug. The issue was not that the files were not being distributed,
but that they were never made visible to the python application
running under Spark.
Also included is a proper unit test for running python on YARN, which
broke in several different ways with the previous code.
A short walk around of the changes:
- SparkSubmit does not try to be smart about how YARN handles python
files anymore. It just passes down the configs to the YARN client
code.
- The YARN client distributes python files and archives differently,
placing the files in a subdirectory.
- The YARN client now sets PYTHONPATH for the processes it launches;
to properly handle different locations, it uses YARN's support for
embedding env variables, so to avoid YARN expanding those at the
wrong time, SparkConf is now propagate to the AM using a conf file
instead of command line options.
- Because the Client initialization code is a maze of implicit
dependencies, some code needed to be moved around to make sure
all needed state was available when the code ran.
- The pyspark tests in YarnClusterSuite now actually distribute and try
to use both a python file and an archive containing a different python
module. Also added a yarn-client tests for completeness.
- I cleaned up some of the code around distributing files to YARN, to
avoid adding more copied & pasted code to handle the new files being
distributed.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vanzin/spark SPARK-5479
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6360.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6360
----
commit 943cbf450d32f49a16091247f2bf7e0679d184ae
Author: Marcelo Vanzin <[email protected]>
Date: 2015-05-21T00:29:29Z
[SPARK-5479] [yarn] Handle --py-files correctly in YARN.
The bug description is a little misleading: the actual issue is that
.py files are not handled correctly when distributed by YARN. They're
added to "spark.submit.pyFiles", which, when processed by context.py,
explicitly whitelists certain extensions (see PACKAGE_EXTENSIONS),
and that does not include .py files.
On top of that, archives were not handled at all! They made it to the
driver's python path, but never made it to executors, since the mechanism
used to propagate their location (spark.submit.pyFiles) only works on
the driver side.
So, instead, ignore "spark.submit.pyFiles" and just build PYTHONPATH
correctly for both driver and executors. Individual .py files are
placed in a subdirectory of the container's local dir in the cluster,
which is then added to the python path. Archives are added directly.
The change, as a side effect, ends up solving the symptom described
in the bug. The issue was not that the files were not being distributed,
but that they were never made visible to the python application
running under Spark.
Also included is a proper unit test for running python on YARN, which
broke in several different ways with the previous code.
A short walk around of the changes:
- SparkSubmit does not try to be smart about how YARN handles python
files anymore. It just passes down the configs to the YARN client
code.
- The YARN client distributes python files and archives differently,
placing the files in a subdirectory.
- The YARN client now sets PYTHONPATH for the processes it launches;
to properly handle different locations, it uses YARN's support for
embedding env variables, so to avoid YARN expanding those at the
wrong time, SparkConf is now propagate to the AM using a conf file
instead of command line options.
- Because the Client initialization code is a maze of implicit
dependencies, some code needed to be moved around to make sure
all needed state was available when the code ran.
- The pyspark tests in YarnClusterSuite now actually distribute and try
to use both a python file and an archive containing a different python
module. Also added a yarn-client tests for completeness.
- I cleaned up some of the code around distributing files to YARN, to
avoid adding more copied & pasted code to handle the new files being
distributed.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]