Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/4481#issuecomment-74588769
Hey Matt, Sorry I'm still a bit confused. Basically my concern is that
we're trying to ignore an underlying bug by adding a configuration option,
which is something we really try not to do.
Is it the case that you regularly have unreliable network connectivity
inside of your cluster? Spark overall assumes that nodes can establish reliable
TCP connections to one another. Do you actually have TCP flows that are
terminated from within the network as a regular occurrence? It's very hard for
me to imagine a modern hardware cluster where this is the case.
The second explanation you gave was the Akka message queue. Akka in general
should be able to process thousands of messages per second which is _way_ more
than anyone would reasonably submit to the standalone cluster manager. It's
possible that we are in some way blocking inside of our actors in a way that is
severely limiting throughput. If that is the case, then we should identify and
fix the bug.
Are you seeing specific akka timeouts or some type of error message that
could help pin down what is happening? My guess is that there is just something
buggy about job submission, and ideally we should fix that instead of trying to
add more knobs to turn to work around it.
If you have a reproduction of this behavior that would actually be the
best. I.e. a stress test or something that could identify what is going on.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]