Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/6360#issuecomment-109073381
  
    @vanzin This looks great. I think the approach makes a lot of sense. The 
inline suggestions I have are mainly documentation and code organization.
    
    A higher level comment I have is that there is currently no high-level 
documentation on how "Python is handled in YARN cluster mode", say vs client 
mode. To understand how it works the reader right now has to go through the 
code and look at how container envs are set up and read the comments there. I 
think it would be very useful to add a large block comment somewhere in 
`SparkSubmit` or `Client` to explain this on a higher level. Some things to 
potentially include in this block comment:
    - pyspark / py4j files are automatically zipped up and put on the 
`PYTHONPATH` of all executors
    - .py files are put in a special dir, which is then put on the `PYTHONPATH`
    - archive python files (.egg, .zip etc.) are put directly on the 
`PYTHONPATH`
    - primary python file is only distributed in cluster mode and only to the 
AM (because the driver runs there)
    - where these python files are (both primary and secondary)
    
    Another thing I was wondering is: was there a particular reason behind the 
change to how Spark properties are distributed (command line java opts -> 
`--properties-file`)? I think it's OK to keep it as is, but it would be good to 
put that in a java doc somewhere. (It's still a surprise to me how `Client` is 
a thousand line file but doesn't even have a high level java doc on the class)
    
    TL;DR I think this is a good change for master. It's pretty close to being 
merged aside from a few documentation suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to