Github user tgravescs commented on the pull request:
https://github.com/apache/spark/pull/4142#issuecomment-83018344
I think this is a good change and let us set some hadoop configs easier,
but I'm a bit concerned about the extra load this could cause if you have a lot
of files in the conf dir. By load I mean on the namenode as well as just task
startup time. The files should all be relatively small so I wouldn't think the
startup time would be affected to much, but I'm curious if you measured it at
all? Some hadoop config dirs could potentially have a lot of files in them.
There are metrics property files, the normal xml files, env files, and whatever
else happens to be thrown in there. For example I looked at one cluster and saw
41 files in that directory. This could cause a bunch of unneeded load on the
namenode, especially if you are starting thousands of executors all at once.
MR obviously packages the confs and send them but its just in one job.xml file.
If these really aren't needed on the executors then perhaps we should just
distribute to the AM. This is possible you probably just need to add another
env/config. We used to do this for certain files when we used sparks
distribution method for jars.
Even though I don't like adding more configs I'm thinking it would be nice
to have switch to turn this off and it to use the ones on the cluster like it
did before.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]