Re: [Linux-HA] one dead ping node caused partially group restart

Andrew Beekhof Thu, 13 Sep 2007 07:06:17 -0700

On 8/30/07, Andreas Kurz <[EMAIL PROTECTED]> wrote:
> On 8/24/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote:
> > On 8/21/07, Andreas Kurz <[EMAIL PROTECTED]> wrote:
> > > Hello all,
> > >
> > > I have a Heartbeat 2.1.2 two-node cluster installation using three
> > > ping nodes to check network connectivity. Today one of the ping nodes
> > > was restarted and the loss of one ping node was detected by both nodes
> > > at the same time (according to the logs).
> > >
> > > The problem was that the pingd score_attribute was only decreased for
> > > one node (holding two resource groups) in a first attempt and
> > > heartbeat began to stop resources to migrate the groups away. During
> > > the resource shutdown the pingd score_attribute of the second node was
> > > also decreased and the resource migration was stopped and restarted on
> > > the current node. Some seconds later the third ping node was up again,
> > > the  pingd score_attribute was updated for both nodes and the
> > > resources were untouched.
> > >
> > > My question is: Is there a way to 'tune' the configuration to avoid
> > > souch resource restarts and why  was the pingd score_attribute updated
> > > 'simultaneously' when the ping node came up again but not when it got
> > > down?
> >
> > if you're using the RA, increase the value of "dampen"
> >
> > if an event happens, we'll wait 'dampen' seconds (or milliseconds) to
> > see if another one occurs on another node - so that we can update the
> > CIB with both of them at the same time.
>
> According to my logs the pingds on both nodes recognized the loss of
> one ping-node at the same time but immediatly after the cib was
> updated once with the first pingd-attribute heartbeat started to stop
> the resources, then the second pingd-attribute update happens a second
> later and the already stopped resources were started again on the same
> host .... was this some sort of a race-condition? Should heartbeat
> maybe wait one additional ping intervall for pingd-attribute updates
> before starting the recalculation of the scores in case one node is a
> little bit  late when sending the updates .... or does this make no
> nense?


in the current design of attrd, there is a small chance that the node
that triggers the update will get in a little too quickly and the
updates don't show up close enough together.  basically thats what
happened here.

the real solution is to have all peers supply their changes to one
node that does the update (hence ensuring the updates are truly
atomic)

we know what we need to do, its just a matter of getting the time to do it...

> ..
> attrd[7171]: 2007/08/21_10:09:55 info: attrd_perform_update: Sent
> update 16: pingd=2000
> tengine[932]: 2007/08/21_10:09:55 info: extract_event: Aborting on
> transient_attributes changes for 738e0605-7e82-47b8-b21a-e69b733eb98b
> ...
> tengine[932]: 2007/08/21_10:09:56 info: extract_event: Aborting on
> transient_attributes changes for dceade77-b3bf-40c7-a4b6-cc8995133aa1
>
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] one dead ping node caused partially group restart

Reply via email to