[
https://issues.apache.org/jira/browse/SLIDER-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
kyungwan nam updated SLIDER-1221:
---------------------------------
Attachment: SLIDER-1221.patch
I'm attaching the patch.
> the way to cope against SliderAM split brain
> --------------------------------------------
>
> Key: SLIDER-1221
> URL: https://issues.apache.org/jira/browse/SLIDER-1221
> Project: Slider
> Issue Type: Bug
> Reporter: kyungwan nam
> Attachments: SLIDER-1221.patch
>
>
> I have met a problem like “Slider-AM split brain”.
> normally, AM is failed, RM will launch new one.
> but, even without failing AM, It can happens if there is something like
> networking issue between AM and RM.
> because, RM is launching the new AM if there is no heartbeat from the AM for
> some time (yarn.am.liveness-monitor.expiry-interval-ms)
> in that case, previous AM and new AM can coexist and containers keep
> connection with previous AM.
> it could cause lots of problems.
> new AM couldn't know the containers launched by previous AM.
> as a result, simultaneous the containers could be killed after some time.
> slider-agent should register to the new SliderAM as soon as possible.
> I think it could be improved as follows.
> - SliderAM record the time at which heartbeat response is arrived from the RM.
> - SliderAM send a message “stale SliderAM” to the slider-agent if there is no
> AM-RM heartbeat for some time (“stale.slider.am.interval”)
> - when slider-agent receive “stale SliderAM”, slider-agent should try to
> discover the new SliderAM. if discovered, register to the new one.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)