[
https://issues.apache.org/jira/browse/SLIDER-1189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gour Saha updated SLIDER-1189:
------------------------------
Fix Version/s: (was: Slider 1.0.0)
Slider 0.92
> Agent never connects to new AM if AM restart takes too long
> -----------------------------------------------------------
>
> Key: SLIDER-1189
> URL: https://issues.apache.org/jira/browse/SLIDER-1189
> Project: Slider
> Issue Type: Bug
> Components: agent
> Reporter: Billie Rinaldi
> Assignee: Billie Rinaldi
> Priority: Critical
> Fix For: Slider 0.92
>
> Attachments: SLIDER-1189.1.patch, SLIDER-1189.2.patch,
> SLIDER-1189.3.patch
>
>
> In testing RM and AM failure scenarios, I killed my RM, killed the AM, waited
> for a bit, then restarted the RM. The AM is restarted, but running agents
> never connect to the new AM. The AM data is re-read from the ZK registry once
> if the heartbeat retry threshold is reached, at which point the agent tries
> re-registering with the AM. However, if the AM data is stale at that point,
> it never re-reads the data from the ZK registry, and retries registering with
> the nonexistent AM forever (until it is timed out due to heartbeat loss and
> killed by the new AM).
> Note this happens when AM restart is delayed more than about a minute, which
> can occur if the RM is down or the RM is up but busy.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)