[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

andrewor14 Sat, 24 May 2014 04:04:01 -0700

Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/853#issuecomment-44084121
  
    @tdas I have pushed a commit that corrects the way we set PYTHONPATH. In a 
nutshell, python does not understand URI schemes (e.g. `file:/`), but the paths 
we add to PYTHONPATH do contain these prefixes (e.g. `file:/path/to/hello.py`). 
Instead, we should strip the prefix and only add the actual path (e.g. 
`/path/to/hello.py`).
    
    Unfortunately, this involves a fairly non-trivial change, because we also 
have to make sure that the provided python files exist locally, such that 
adding them to the PYTHONPATH is actually meaningful.
    
    Also, we have been adding the python file itself to the PYTHONPATH. This is 
incorrect and does not work on YARN; instead, we should be adding the python 
file's containing directory. However, `--py-files` may also contain zip files, 
in which case we still have to add the file itself to the PYTHONPATH. This is 
reflected in my latest commit (in context.py).
    
    This is a slightly invasive change, but much of the new code are tests for 
formatting the paths properly. The good news is that I have tested this 
locally, on a CDH5 cluster, and on Windows, and everything behaves as expected. 
More specifically, on each of these deploy modes, I ran a combination of 
spark-shell, spark-submit, and pyspark, with jars / python files referencing 
each other. I can confirm that `--py-files` (which was broken for YARN before 
this commit) is now working.
    
    I have not had the time to test this on standalone mode or HDP cluster 
(especially with Hadoop 2.4). After these have been tested, I think this PR is 
ready for merge.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1900 / 1918] PySpark on YARN is broken

Reply via email to