[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

andrewor14 Thu, 22 May 2014 01:37:26 -0700

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/853


    [SPARK-1900] Fix running PySpark files on YARN

    If I run the following on a YARN cluster
    ```
    bin/spark-submit sheep.py --master yarn-client
    ```
    it fails because of a mismatch in paths. Spark submit thinks that 
`sheep.py` resides on HDFS, and balks when it can't find the file there. A 
natural workaround is to add the `file:` prefix to the file:
    ```
    bin/spark-submit file:/path/to/sheep.py --master yarn-client
    ```
    However, this also fails, this time because python does not understand URI 
schemes.
    
    This PR fixes this by automatically resolving all paths passed as command 
line argument to spark-submit properly. This has the added benefit of keeping 
file and jar paths consistent across different cluster modes.
    
    Much of the code is written by @mengxr.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark submit-paths

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/853.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #853
    
----
commit 02f77f39c5f8a530c58f86dbe28e8e3507c3cfb0
Author: Andrew Or <[email protected]>
Date:   2014-05-22T08:17:08Z

    Resolve command line arguments to spark-submit properly
    
    Jars and files provided to spark-submit are treated as HDFS paths
    on YARN clusters, even if they exist locally. This is inconsistent
    across different modes. Instead, we should always treat the command
    line argument paths passed to spark-submit as local paths, unless
    otherwise specified.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900] Fix running PySpark files on YARN

Reply via email to