Sheng Yang created CLOUDSTACK-1653:
--------------------------------------

             Summary: Redundant router: check_heartbeat.sh malfunction caused 
by delayed cron job
                 Key: CLOUDSTACK-1653
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-1653
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
    Affects Versions: 4.1.0
            Reporter: Sheng Yang
            Assignee: Sheng Yang
             Fix For: 4.1.0


According to: https://bugzilla.redhat.com/show_bug.cgi?id=159441

cron can only guarantee the minimum interval of execution jobs, so two check of 
check_heartbeat.sh would possibly take more than 1 minutes.

Since keepalived should update keepalived.ts every 10 seconds, so if any of two 
execution have gap less than 60 seconds, it should fail. 

The current logic in the check_heartbeat.sh is wrong, which only guarantee cron 
didn't delay, but not keepalived is alive. 

This pass the original test because it was a NFS disconnecting test, in which 
case disk is corrupted, so cron got delayed, means network is down.

Change the condition to less than 60(probably 30 is safer because seems 
sometime cron has bug for not meeting the minimum interval requirement) should 
works too. Because it should find out that keepalived is dead after second time 
it was executed after NFS recovered.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to