Github user dragos commented on the pull request:
https://github.com/apache/spark/pull/4984#issuecomment-105246967
For easier testing, I uploaded a [pre-built binary
here](http://downloads.typesafe.com/tmp/spark-prs/spark-1.4.0-SNAPSHOT-bin-dynalloc013-fdbbc32.tgz)
The gist is the following:
- shuffle files are stored in the temporary directory
(`/tmp/blockmgr/...`), but not deleted on exit
- shuffle files are now deleted by the external shuffle service when the
application exits (this functionality was already in the external shuffle
service, but never called)
- the BlockManagerMaster keeps an ever-growing list of BlockManagerIds,
essentially remembering all executors that ever lived
- hooked the `applicationRemoved` call on BlockManager.stop (only the
driver will call this method). Now it goes through all known executors and
calls the external shuffle service telling it to remove the application and
delete shuffle files
I verified that this works in the previous scenario and that files are
correctly deleted when the application exits. I'd really appreciate some
feedback on this approach/tests.
A couple of things I would still look into
- [ ] make `applicationRemoved` async so all messages are sent in parallel
when stopping the block manager
- [ ] check that there are no duplicate messages (a certain worker may wake
up again to host a new executor, but there's only one shuffle service on that
host)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]