I appologize for the duplicate of this on yarn-dev. I realized later that this probably is more related to MR.
I am running MR with a non-HDFS file system backend (Ceph), and have noticed that some processes exit or are being killed before the file system client is properly shutdown (i.e. FileSystem::close completing). We need clean shutdowns right now because they release resources that when not cleaned up lead to fs timeouts that slow every other client down. We've adjusted the yarn timeout affecting the delay before SIGKILL is sent to containers which resolves the problem for containers with map tasks, but there is one instance of an unclean shutdown that I'm having trouble tracking down. Based on the file system trace of this unknown process it appears that it is the AppMaster, or some other manager process. In particular it stats all of the files related to the job, and at the end removes many configuration files, COMMIT_SUCCESS file, and finally removes the job staging directory, which seems to match up with the behavior of the AppMaster. So the first question is am I actually seeing the behavior of the AppMaster (full trace is here http://pastebin.com/SVCfRfA4)? After that final job staging directory is removed the fs trace is truncated suggesting the process immediately exited or was killed. So the second question is, if this is the app master, what might be causing the unclean fs shutdown and is there a way to control this? I noticed that in MRAppMaster::main there is `conf.setBoolean("fs.automatic.close", false);` but I cannot seem to find any instance of the file systems having close called explicity. Thanks, Noah
