Github user ash211 commented on the pull request:
https://github.com/apache/spark/pull/3130#issuecomment-65878128
I think a core disconnect here is that the Spark team thought the majority
use of Spark in applications would be through the spark-submit script. But
Matt and I (and @msgehard and probably others) are trying to connect to Spark
from an independent application that isn't launched through spark-submit.
I don't know the entirety of the reason our team made this choice, but
parts of it include having init scripts on the service, controlling logging in
a unified way with other non-Spark applications, controlling the application
layout (binaries, logs, pidfiles, scratch space, etc), controlling the
application lifecycle outside of the scripts, the ability to spin up new
SparkContexts in a subprocess JVM, don't have to create an uberjar to submit to
Spark and can instead have an application's jars in a lib/ folder, and I'm sure
a few others I'm missing right now.
What I think might be most valuable here is to probe more into why teams
feel the need to connect to Spark by circumventing spark-submit.sh. It's
possible that there are use cases that the script doesn't account for, and this
alternate route is going to continue causing friction and dependency shading
requests until we figure out a resolution to the issue.
@pwendell and @mccheah does that seem reasonable?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]