Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/4984#issuecomment-118184097
@dragos I think this is pretty close. The one outstanding issue is still
who cleans up the shuffle files.
This patch proposes to have the driver send clean up requests to all
shuffle services when the application exits. As I mentioned earlier, there are
no guarantees here even if we install shutdown hooks.
We cannot expect the driver to manage its own exit; this is really the
responsibility of the cluster manager. My main objection is that for other
modes like YARN and standalone mode (coming soon) we'll have two separate code
paths to clean up the shuffle files, which leads to additional complexity.
I understand that the external shuffle service is started independently of
Mesos, so even though the Mesos master knows when the framework exits, we still
have to pass this information on to the shuffle service. There are two
potential ways to make this work:
- (1) Have the shuffle service query the master for framework status
periodically. When the framework has exited, we clean up the corresponding
application's shuffle files. @tnachen is there enough support on the Mesos side
to make this happen?
- (2) Initiate a long-running connection between the shuffle service and
each driver. When the connection is closed on the other end, the shuffle
service knows the corresponding application has exited.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]