Re: [Linux-HA] one dead ping node caused partially group restart

Andreas Kurz Thu, 30 Aug 2007 09:45:44 -0700

On 8/24/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> On 8/21/07, Andreas Kurz <[EMAIL PROTECTED]> wrote:
> > Hello all,
> >
> > I have a Heartbeat 2.1.2 two-node cluster installation using three
> > ping nodes to check network connectivity. Today one of the ping nodes
> > was restarted and the loss of one ping node was detected by both nodes
> > at the same time (according to the logs).
> >
> > The problem was that the pingd score_attribute was only decreased for
> > one node (holding two resource groups) in a first attempt and
> > heartbeat began to stop resources to migrate the groups away. During
> > the resource shutdown the pingd score_attribute of the second node was
> > also decreased and the resource migration was stopped and restarted on
> > the current node. Some seconds later the third ping node was up again,
> > the  pingd score_attribute was updated for both nodes and the
> > resources were untouched.
> >
> > My question is: Is there a way to 'tune' the configuration to avoid
> > souch resource restarts and why  was the pingd score_attribute updated
> > 'simultaneously' when the ping node came up again but not when it got
> > down?
>
> if you're using the RA, increase the value of "dampen"
>
> if an event happens, we'll wait 'dampen' seconds (or milliseconds) to
> see if another one occurs on another node - so that we can update the
> CIB with both of them at the same time.


According to my logs the pingds on both nodes recognized the loss of
one ping-node at the same time but immediatly after the cib was
updated once with the first pingd-attribute heartbeat started to stop
the resources, then the second pingd-attribute update happens a second
later and the already stopped resources were started again on the same
host .... was this some sort of a race-condition? Should heartbeat
maybe wait one additional ping intervall for pingd-attribute updates
before starting the recalculation of the scores in case one node is a
little bit  late when sending the updates .... or does this make no
nense?

..
attrd[7171]: 2007/08/21_10:09:55 info: attrd_perform_update: Sent
update 16: pingd=2000
tengine[932]: 2007/08/21_10:09:55 info: extract_event: Aborting on
transient_attributes changes for 738e0605-7e82-47b8-b21a-e69b733eb98b
...
tengine[932]: 2007/08/21_10:09:56 info: extract_event: Aborting on
transient_attributes changes for dceade77-b3bf-40c7-a4b6-cc8995133aa1

Regards,
Andreas

>
> >
> > Regards,
> > Andreas
> >
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> >
> >
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] one dead ping node caused partially group restart

Reply via email to