[ https://issues.apache.org/jira/browse/APEXCORE-743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sandesh reassigned APEXCORE-743: -------------------------------- Assignee: Sandesh > Killed container is shown as running > ------------------------------------ > > Key: APEXCORE-743 > URL: https://issues.apache.org/jira/browse/APEXCORE-743 > Project: Apache Apex Core > Issue Type: Bug > Reporter: Sandesh > Assignee: Sandesh > > Here is the behavior > 1. Container Heartbeat timeout happened > 2. AppMaster sends the request to kill the container > 3. Container is killed > 4. AppMaster state is not updated and no new container was allocated > After analyzing the code here is the possible reason > 1. Send the kill request to NM > 2. Container killed by NM, but NM callback doesn't happen. RecoverContainer > is called in NM callback, which in this case is not called. > 3. AppMaster state is not updated > Possible fix. > Have a timeout for NM callback, so that if NM doesn't respond that the > container is killed in time, call the RecoverContainer. -- This message was sent by Atlassian JIRA (v6.3.15#6346)