Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/853#issuecomment-44084121
@tdas I have pushed a commit that corrects the way we set PYTHONPATH. In a
nutshell, python does not understand URI schemes (e.g. `file:/`), but the paths
we add to PYTHONPATH do contain these prefixes (e.g. `file:/path/to/hello.py`).
Instead, we should strip the prefix and only add the actual path (e.g.
`/path/to/hello.py`).
Unfortunately, this involves a fairly non-trivial change, because we also
have to make sure that the provided python files exist locally, such that
adding them to the PYTHONPATH is actually meaningful.
Also, we have been adding the python file itself to the PYTHONPATH. This is
incorrect and does not work on YARN; instead, we should be adding the python
file's containing directory. However, `--py-files` may also contain zip files,
in which case we still have to add the file itself to the PYTHONPATH. This is
reflected in my latest commit (in context.py).
This is a slightly invasive change, but much of the new code are tests for
formatting the paths properly. The good news is that I have tested this
locally, on a CDH5 cluster, and on Windows, and everything behaves as expected.
More specifically, on each of these deploy modes, I ran a combination of
spark-shell, spark-submit, and pyspark, with jars / python files referencing
each other. I can confirm that `--py-files` (which was broken for YARN before
this commit) is now working.
I have not had the time to test this on standalone mode or HDP cluster
(especially with Hadoop 2.4). After these have been tested, I think this PR is
ready for merge.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---