Sheng Yang created CLOUDSTACK-1653: -------------------------------------- Summary: Redundant router: check_heartbeat.sh malfunction caused by delayed cron job Key: CLOUDSTACK-1653 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-1653 Project: CloudStack Issue Type: Bug Security Level: Public (Anyone can view this level - this is the default.) Affects Versions: 4.1.0 Reporter: Sheng Yang Assignee: Sheng Yang Fix For: 4.1.0
According to: https://bugzilla.redhat.com/show_bug.cgi?id=159441 cron can only guarantee the minimum interval of execution jobs, so two check of check_heartbeat.sh would possibly take more than 1 minutes. Since keepalived should update keepalived.ts every 10 seconds, so if any of two execution have gap less than 60 seconds, it should fail. The current logic in the check_heartbeat.sh is wrong, which only guarantee cron didn't delay, but not keepalived is alive. This pass the original test because it was a NFS disconnecting test, in which case disk is corrupted, so cron got delayed, means network is down. Change the condition to less than 60(probably 30 is safer because seems sometime cron has bug for not meeting the minimum interval requirement) should works too. Because it should find out that keepalived is dead after second time it was executed after NFS recovered. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira