Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/3861#issuecomment-74207099
@tnachen Thanks for working on this. Before I dive deeper into the
implementation there's a main open question that I'd like to address. The
external shuffle service is intended to live across the executor lifetime and
so launched independently of any Spark application. The service enables dynamic
allocation of resources because it can continue to serve an executor's shuffle
files after the executor has been killed. However, in this patch the service
seems to be started inside the executor backend itself and its fate necessarily
tied with the application.
If I understand correctly, the Mesos slave is equivalent to the standalone
Worker in that it is long running and lives beyond the lifetime of a particular
application. If this is the case, the appropriate place to start the shuffle
service would be there instead.
Another issue is that this patch in its current state seems to conflate two
issues (1) dynamic allocation and (2) external shuffle service. (1) is what you
refer to as auto-scaling on the JIRA, and depends on (2) to work. However,
since we already check whether shuffle service is enabled in
`ExecutorAllocationManager`, we shouldn't check it again when launching the
Mesos executor. More specifically, I don't understand why we launch the
executor two different ways depending on whether (2) is enabled. I believe a
better solution is to separate these two concerns and launch the executor the
same way we already do today.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]