On 20 May 2016, at 00:34, Shankar Venkataraman 
<shankarvenkataraman...@gmail.com<mailto:shankarvenkataraman...@gmail.com>> 
wrote:

Thanks Luciano. The case we are seeing is different - the yarn resource manager 
is shutting down the container in which the executor is running since there 
does not seem to be a response and it is deeming it dead. It started another 
container but the driver seems to be oblivious for nearly 2 hours. Am wondering 
if there is a condition where the driver is not seeing the notification from 
the Yarn RM about the executor container going away. We will try some of the 
settings you pointed to, and see if alleviates the issue.

Shankar



the YARN RM doesn't (AFAIK) do any liveness checks on executors.

1. The AM regularly heartbeats with the RM;
2. if that stops the AM is killed (and unless its requested container 
preservation), all its containers. The AM is then restarted (if retries < 
yarn.am.retry.count (?").
3. Node Managers, one per server, heartbeat to the RM.
4. If they stop checking in, AM assumes node and all running containers are 
dead, reports failures to the AM, leaves it to deal with. (Special case: Work 
preserving NM restart).
5. If the process running in  a container fails, the NM picks it up and relays 
that to the AM via the RM.

some details: http://www.slideshare.net/steve_l/yarn-services


Have a look in the NM logs to see what it thinks is happening —but i think it 
may well be some driver/executor communication problem.


Reply via email to