[
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yuzhao Chen updated STORM-2044:
-------------------------------
Description:
Now pacemaker is a stand-alone service and no HA is supported. When it
goes down, all the workers's heartbeats will be lost. It will take a long time
to recover even if pacemaker goes up immediately if there are dozens GB of
heartbeats. During the time worker heartbeats are not restored completely,
Nimbus will think these workers are died because of heartbeats timeout and
reassign these "dead" workers continuously until heartbeats restore to normal.
So, during recovery time, many topologies will be reassigned continuously and
the throughout will goes very down.
This is not acceptable.
So i think, pacemaker is not suitable for production if the problem
above exists.
i think several ways to solve this problem:
1. pacemaker HA
2. when pacemaker does down, notice nimbus not to reassign any more
until it recover
was:Now pacemaker is a stand-alone service and not HA. When is goes down, all
the workers's heartbeats will be lost. It will task a long time to recover even
if pacemaker goes up immediately if there are dozens GBs of heartbeats. During
the time worker heartbeats are not restored completely, Nimbus will think these
workers are died because of heartbeats timeout and reassign these "dead"
workers continuously until heartbeats restore to normal. So, during recovery
time, many topologies will be reassigned and the throughout will goes very
down.
> Nimbus should not make assignments crazily when Pacemaker goes down
> -------------------------------------------------------------------
>
> Key: STORM-2044
> URL: https://issues.apache.org/jira/browse/STORM-2044
> Project: Apache Storm
> Issue Type: Improvement
> Components: storm-core
> Affects Versions: 1.0.2
> Environment: CentOS 6.5
> Reporter: Yuzhao Chen
> Labels: patch
> Fix For: 1.1.0
>
> Original Estimate: 672h
> Remaining Estimate: 672h
>
> Now pacemaker is a stand-alone service and no HA is supported. When
> it goes down, all the workers's heartbeats will be lost. It will take a long
> time to recover even if pacemaker goes up immediately if there are dozens GB
> of heartbeats. During the time worker heartbeats are not restored completely,
> Nimbus will think these workers are died because of heartbeats timeout and
> reassign these "dead" workers continuously until heartbeats restore to
> normal. So, during recovery time, many topologies will be reassigned
> continuously and the throughout will goes very down.
> This is not acceptable.
> So i think, pacemaker is not suitable for production if the problem
> above exists.
> i think several ways to solve this problem:
> 1. pacemaker HA
> 2. when pacemaker does down, notice nimbus not to reassign any more
> until it recover
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)