On 4/13/07, Piotr Kaczmarzyk <[EMAIL PROTECTED]> wrote:
Hi,I'm using version 2.0.8 and I tried to provide a highly-available squid service. I wrote my own OCF script which was tested in two versions: ver 1. 'Start' function started squid, waited a few seconds, then tried to connect to port 8080, issued a HTTP request and returned either $OCF_SUCCESS or $OCF_ERR_GENERIC depending on the result. 'Monitor' function did similar check. ver 2. 'Start' function always returned $OCF_SUCCESS (only 'monitor' function was able to check squid status) In the first case when squid failed to start it was moved to another node (that's what I expected), but I was unable to move it back to the original node (by using crm_standby for example). It was somehow permanently marked as 'unable to start on node 1'. I tried to clear failcount using crm_failcount, but that was not the case. After disabling second node (or putting it into standby mode, or failing to start squid there) squid stopped to run anywhere. How to setup the node to accept that resource again in such situation? (restarting the node helps :| )
http://www.linux-ha.org/v2/faq/manual_recovery
When I tested second version of the script the resource that failed to start stayed on the node all the time and failcount increased. By changing "default-resource-stickiness" and "default-resource-failure-stickiness" parameters I was be able to force it to move to another node after a number of failures. Is this the right way to deal with such cases?
sure
That will require squid to be restarted a few times before it will be moved to another node, so it will take some time. Checking during startup seems to be more appropriate in case where squid is unable to start at all. Moreover if the failcount accumulates over time (for example squid on one node restarts once a few days for some reason) the resource will be moved to another node after a number of failures, while IMHO it should stay on the current node. One solution that comes to my mind is to reset failcount for example every day at 3 AM, but if there is a serious problem (i.e. squid is unable to start on one node at all) that will cause a few minutes downtime (until failcount rises again). Or perhaps I should leave that as is because occasional service restarts require user intervention anyway?
eventually we hope to timeout the failures but for now you need to manually decrease or remove the failcount setting in whatever way best suits your needs.
How to notify administrator that certain service failed to start a few times and was moved to another node due to high failover count? (which means that first node needs some maintenance) Should I use another software like 'mon'?
you can put the resource in a group with the MailTo agent to be notified whenever the group moves
Best regards, Piotr _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
