Github user dragos commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4984#discussion_r33755309
  
    --- Diff: 
core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala ---
    @@ -124,10 +124,16 @@ private[spark] class DiskBlockManager(blockManager: 
BlockManager, conf: SparkCon
         (blockId, getFile(blockId))
       }
     
    +  /**
    +   * Create local directories for storing block data. These directories are
    +   * located inside configured local directories and won't
    +   * be deleted on JVM exit when using the external shuffle service.
    --- End diff --
    
    You are conflating two different issues.
    
    1. Of course shuffle files were deleted! The were deleted as soon as an 
executor got killed. That's the reason why @tnachen reported those 
`FileNotFound` failures! Their parent directory is deleted on a shutdown hook 
(installed by `createTempDir`. That includes all subdirectories, regardless of 
the test you point to. Yes, it won't delete them on `doStop` (I was initially 
fooled by that code too), but the *parent dir* (and obviously everything 
underneath) *is* deleted on exit, through the shutdown hook. I verified this 
using log statements, so I'm pretty sure that's the case. Could we have a 
Hangout session to go through this? I have the feeling we're talking past each 
other.
    
    So, once I fixed that, I had a new problem: these files need to be 
eventually cleaned up, or they would pile up indefinitely.
    
    2. As you rightfully pointed out, the external shuffle service needs to 
delete those files. In my implementation I delete them when the driver stops. 
Therefore I send `applicationRemoved`, from the driver, to each Mesos slave 
that (at some point) had executors running. The external shuffle service is 
started externally, outside of Mesos, so Mesos does *not* know that the 
application exited. The external shuffle service runs on Mesos slaves all the 
time (per @pwendell's suggestion  
[here](https://github.com/apache/spark/pull/3861#issuecomment-74950927)), and 
is not managed by Mesos.
    
    We could hook that code to a shutdown hook instead of the normal path, but 
I want first to get to the same page regarding how it works now, and why it 
works the way it does.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to