Github user dragos commented on the pull request:

    https://github.com/apache/spark/pull/4984#issuecomment-105246967
  
    For easier testing, I uploaded a [pre-built binary 
here](http://downloads.typesafe.com/tmp/spark-prs/spark-1.4.0-SNAPSHOT-bin-dynalloc013-fdbbc32.tgz)
    
    The gist is the following:
    
    - shuffle files are stored in the temporary directory 
(`/tmp/blockmgr/...`), but not deleted on exit
    - shuffle files are now deleted by the external shuffle service when the 
application exits (this functionality was already in the external shuffle 
service, but never called)
    - the BlockManagerMaster keeps an ever-growing list of BlockManagerIds, 
essentially remembering all executors that ever lived
    - hooked the `applicationRemoved` call on BlockManager.stop (only the 
driver will call this method). Now it goes through all known executors and 
calls the external shuffle service telling it to remove the application and 
delete shuffle files
    
    I verified that this works in the previous scenario and that files are 
correctly deleted when the application exits. I'd really appreciate some 
feedback on this approach/tests.
    
    A couple of things I would still look into
    - [ ] make `applicationRemoved` async so all messages are sent in parallel 
when stopping the block manager
    - [ ] check that there are no duplicate messages (a certain worker may wake 
up again to host a new executor, but there's only one shuffle service on that 
host)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to