[ 
https://issues.apache.org/jira/browse/CLOUDSTACK-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13600697#comment-13600697
 ] 

ASF subversion and git services commented on CLOUDSTACK-1653:
-------------------------------------------------------------

Commit 630e75596ed6a4cf769b24900d383a05ebb25cdc in branch refs/heads/master 
from Sheng Yang <sheng.y...@citrix.com>
[ https://git-wip-us.apache.org/repos/asf?p=incubator-cloudstack.git;h=630e755 ]

CLOUDSTACK-1653: Redundant router: Fix check_heartbeat.sh malfunctional due to 
delayed cron job

The interval between keepalived.ts and keepalived.ts2 should be >= 60 seconds in
normal condition, because every 10 seconds keepalived.ts would be updated, and
at least every 60 seconds, keepalived.ts would be copy to keepalived.ts2.

If the interval is less than 60 seconds, then keepalived process failed to
update keepalived.ts every 10 seconds.

Take some delay of updating into consideration, check_heartbeat.sh would use 30
seconds as a way to tell keepalived process is alive or not.

                
> Redundant router: check_heartbeat.sh malfunction caused by delayed cron job
> ---------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-1653
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-1653
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the 
> default.) 
>    Affects Versions: 4.1.0
>            Reporter: Sheng Yang
>            Assignee: Sheng Yang
>             Fix For: 4.1.0
>
>
> According to: https://bugzilla.redhat.com/show_bug.cgi?id=159441
> cron can only guarantee the minimum interval of execution jobs, so two check 
> of check_heartbeat.sh would possibly take more than 1 minutes.
> Since keepalived should update keepalived.ts every 10 seconds, so if any of 
> two execution have gap less than 60 seconds, it should fail. 
> The current logic in the check_heartbeat.sh is wrong, which only guarantee 
> cron didn't delay, but not keepalived is alive. 
> This pass the original test because it was a NFS disconnecting test, in which 
> case disk is corrupted, so cron got delayed, means network is down.
> Change the condition to less than 60(probably 30 is safer because seems 
> sometime cron has bug for not meeting the minimum interval requirement) 
> should works too. Because it should find out that keepalived is dead after 
> second time it was executed after NFS recovered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to