GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/853
[SPARK-1900] Fix running PySpark files on YARN
If I run the following on a YARN cluster
```
bin/spark-submit sheep.py --master yarn-client
```
it fails because of a mismatch in paths. Spark submit thinks that
`sheep.py` resides on HDFS, and balks when it can't find the file there. A
natural workaround is to add the `file:` prefix to the file:
```
bin/spark-submit file:/path/to/sheep.py --master yarn-client
```
However, this also fails, this time because python does not understand URI
schemes.
This PR fixes this by automatically resolving all paths passed as command
line argument to spark-submit properly. This has the added benefit of keeping
file and jar paths consistent across different cluster modes.
Much of the code is written by @mengxr.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark submit-paths
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/853.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #853
----
commit 02f77f39c5f8a530c58f86dbe28e8e3507c3cfb0
Author: Andrew Or <[email protected]>
Date: 2014-05-22T08:17:08Z
Resolve command line arguments to spark-submit properly
Jars and files provided to spark-submit are treated as HDFS paths
on YARN clusters, even if they exist locally. This is inconsistent
across different modes. Instead, we should always treat the command
line argument paths passed to spark-submit as local paths, unless
otherwise specified.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---