Github user dragos commented on a diff in the pull request:
https://github.com/apache/spark/pull/4984#discussion_r33755309
--- Diff:
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
@@ -124,10 +124,16 @@ private[spark] class DiskBlockManager(blockManager:
BlockManager, conf: SparkCon
(blockId, getFile(blockId))
}
+ /**
+ * Create local directories for storing block data. These directories are
+ * located inside configured local directories and won't
+ * be deleted on JVM exit when using the external shuffle service.
--- End diff --
You are conflating two different issues.
1. Of course shuffle files were deleted! The were deleted as soon as an
executor got killed. That's the reason why @tnachen reported those
`FileNotFound` failures! Their parent directory is deleted on a shutdown hook
(installed by `createTempDir`. That includes all subdirectories, regardless of
the test you point to. Yes, it won't delete them on `doStop` (I was initially
fooled by that code too), but the *parent dir* (and obviously everything
underneath) *is* deleted on exit, through the shutdown hook. I verified this
using log statements, so I'm pretty sure that's the case. Could we have a
Hangout session to go through this? I have the feeling we're talking past each
other.
So, once I fixed that, I had a new problem: these files need to be
eventually cleaned up, or they would pile up indefinitely.
2. As you rightfully pointed out, the external shuffle service needs to
delete those files. In my implementation I delete them when the driver stops.
Therefore I send `applicationRemoved`, from the driver, to each Mesos slave
that (at some point) had executors running. The external shuffle service is
started externally, outside of Mesos, so Mesos does *not* know that the
application exited. The external shuffle service runs on Mesos slaves all the
time (per @pwendell's suggestion
[here](https://github.com/apache/spark/pull/3861#issuecomment-74950927)), and
is not managed by Mesos.
We could hook that code to a shutdown hook instead of the normal path, but
I want first to get to the same page regarding how it works now, and why it
works the way it does.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]