Github user tgravescs commented on the pull request:

    https://github.com/apache/spark/pull/4142#issuecomment-83018344
  
    I think this is a good change and let us set some hadoop configs easier, 
but I'm a bit concerned about the extra load this could cause if you have a lot 
of files in the conf dir.  By load I mean on the namenode as well as just task 
startup time.  The files should all be relatively small so I wouldn't think the 
startup time would be affected to much, but I'm curious if you measured it at 
all?  Some hadoop config dirs could potentially have a lot of files in them.  
There are metrics property files, the normal xml files, env files, and whatever 
else happens to be thrown in there. For example I looked at one cluster and saw 
41 files in that directory.   This could cause a bunch of unneeded load on the 
namenode, especially if you are starting thousands of executors all at once.  
MR obviously packages the confs and send them but its just in one job.xml file. 
 
    
    If these really aren't needed on the executors then perhaps we should just 
distribute to the AM. This is possible you probably just need to add another 
env/config. We used to do this for certain files when we used sparks 
distribution method for jars.
    
    Even though I don't like adding more configs I'm thinking it would be nice 
to have switch to turn this off and it to use the ones on the cluster like it 
did before. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to