[
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuzhao Chen updated STORM-2044:
-------------------------------
Summary: Nimbus should not make assignments crazily when Pacemaker goes
down and up (was: Nimbus should not make assignments crazily when Pacemaker
goes down)
> Nimbus should not make assignments crazily when Pacemaker goes down and up
> --------------------------------------------------------------------------
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Affects Versions: 1.0.2
> Environment: CentOS 6.5
> Reporter: Yuzhao Chen
> Labels: patch
> Fix For: 1.1.0
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and no HA is supported. When
> it goes down, all the workers's heartbeats will be lost. It will take a long
> time to recover even if pacemaker goes up immediately if there are dozens GB
> of heartbeats. During the time worker heartbeats are not restored completely,
> Nimbus will think these workers are dead because of heartbeats timeout and
> reassign these "dead" workers continuously until heartbeats restore to
> normal. So, during recovery time, many topologies will be reassigned
> continuously and the throughout will goes very down.
> This is not acceptable.
> So i think, pacemaker is not suitable for production if the problem
> above exists.
> i think several ways to solve this problem:
> 1. pacemaker HA
> 2. when pacemaker does down, notice nimbus not to reassign any
> more until it recover
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)