On 20 May 2016, at 00:34, Shankar Venkataraman <shankarvenkataraman...@gmail.com<mailto:shankarvenkataraman...@gmail.com>> wrote:
Thanks Luciano. The case we are seeing is different - the yarn resource manager is shutting down the container in which the executor is running since there does not seem to be a response and it is deeming it dead. It started another container but the driver seems to be oblivious for nearly 2 hours. Am wondering if there is a condition where the driver is not seeing the notification from the Yarn RM about the executor container going away. We will try some of the settings you pointed to, and see if alleviates the issue. Shankar the YARN RM doesn't (AFAIK) do any liveness checks on executors. 1. The AM regularly heartbeats with the RM; 2. if that stops the AM is killed (and unless its requested container preservation), all its containers. The AM is then restarted (if retries < yarn.am.retry.count (?"). 3. Node Managers, one per server, heartbeat to the RM. 4. If they stop checking in, AM assumes node and all running containers are dead, reports failures to the AM, leaves it to deal with. (Special case: Work preserving NM restart). 5. If the process running in a container fails, the NM picks it up and relays that to the AM via the RM. some details: http://www.slideshare.net/steve_l/yarn-services Have a look in the NM logs to see what it thinks is happening —but i think it may well be some driver/executor communication problem.