Re: [Linux-HA] Restarting a resource that failed to start

Andrew Beekhof Thu, 19 Apr 2007 03:00:16 -0700

On 4/13/07, Piotr Kaczmarzyk <[EMAIL PROTECTED]> wrote:

Hi,


I'm using version 2.0.8 and I tried to provide a highly-available squid
service. I wrote my own OCF script which was tested in two versions:

ver 1. 'Start' function started squid, waited a few seconds, then tried to
    connect to port 8080, issued a HTTP request and returned either
    $OCF_SUCCESS or $OCF_ERR_GENERIC depending on the result. 'Monitor'
    function did similar check.

ver 2. 'Start' function always returned $OCF_SUCCESS (only 'monitor'
    function was able to check squid status)

In the first case when squid failed to start it was moved to another node
(that's what I expected), but I was unable to move it back to the original
node (by using crm_standby for example). It was somehow permanently marked
as 'unable to start on node 1'. I tried to clear failcount using
crm_failcount, but that was not the case. After disabling second node (or
putting it into standby mode, or failing to start squid there) squid
stopped to run anywhere. How to setup the node to accept that resource
again in such situation? (restarting the node helps :| )


http://www.linux-ha.org/v2/faq/manual_recovery

When I tested second version of the script the resource that failed to
start stayed on the node all the time and failcount increased. By changing
"default-resource-stickiness" and "default-resource-failure-stickiness"
parameters I was be able to force it to move to another node after a
number of failures. Is this the right way to deal with such cases?


sure

That
will require squid to be restarted a few times before it will be moved to
another node, so it will take some time. Checking during startup seems to
be more appropriate in case where squid is unable to start at all.

Moreover if the failcount accumulates over time (for example squid on one
node restarts once a few days for some reason) the resource will be moved
to another node after a number of failures, while IMHO it should stay on
the current node. One solution that comes to my mind is to reset failcount
for example every day at 3 AM, but if there is a serious problem (i.e.
squid is unable to start on one node at all) that will cause a few minutes
downtime (until failcount rises again). Or perhaps I should leave that as
is because occasional service restarts require user intervention anyway?


eventually we hope to timeout the failures but for now you need to
manually decrease or remove the failcount setting in whatever way best
suits your needs.

How to notify administrator that certain service failed to start a few
times and was moved to another node due to high failover count? (which
means that first node needs some maintenance) Should I use another
software like 'mon'?


you can put the resource in a group with the MailTo agent to be
notified whenever the group moves


Best regards,

Piotr
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Restarting a resource that failed to start

Reply via email to