[ 
https://issues.apache.org/jira/browse/STORM-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuzhao Chen updated STORM-2044:
-------------------------------
    Description: 
        Now pacemaker is a stand-alone service and no HA is supported. When it 
goes down, all the workers's heartbeats will be lost. It will take a long time 
to recover even if pacemaker goes up immediately if there are dozens GB of 
heartbeats. During the time worker heartbeats are not restored completely, 
Nimbus will think these workers are died because of heartbeats timeout and 
reassign these "dead" workers continuously until heartbeats restore to normal. 
So, during recovery time, many topologies will be reassigned continuously and 
the throughout will goes very down.  
        This is not acceptable. 
        So i think, pacemaker is not suitable for production if the problem 
above exists.
        i think several ways to solve this problem:
        1. pacemaker HA
        2. when pacemaker does down, notice nimbus not to reassign any more 
until it recover

  was:Now pacemaker is a stand-alone service and not HA. When is goes down, all 
the workers's heartbeats will be lost. It will task a long time to recover even 
if pacemaker goes up immediately if there are dozens GBs of heartbeats. During 
the time worker heartbeats are not restored completely, Nimbus will think these 
workers are died because of heartbeats timeout and reassign these "dead" 
workers continuously until heartbeats restore to normal. So, during recovery 
time, many topologies will be reassigned and the throughout will goes very 
down. 


> Nimbus should not make assignments crazily when Pacemaker goes down
> -------------------------------------------------------------------
>
>                 Key: STORM-2044
>                 URL: https://issues.apache.org/jira/browse/STORM-2044
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>    Affects Versions: 1.0.2
>         Environment: CentOS 6.5
>            Reporter: Yuzhao Chen
>              Labels: patch
>             Fix For: 1.1.0
>
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
>         Now pacemaker is a stand-alone service and no HA is supported. When 
> it goes down, all the workers's heartbeats will be lost. It will take a long 
> time to recover even if pacemaker goes up immediately if there are dozens GB 
> of heartbeats. During the time worker heartbeats are not restored completely, 
> Nimbus will think these workers are died because of heartbeats timeout and 
> reassign these "dead" workers continuously until heartbeats restore to 
> normal. So, during recovery time, many topologies will be reassigned 
> continuously and the throughout will goes very down.  
>         This is not acceptable. 
>         So i think, pacemaker is not suitable for production if the problem 
> above exists.
>         i think several ways to solve this problem:
>         1. pacemaker HA
>         2. when pacemaker does down, notice nimbus not to reassign any more 
> until it recover



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to