On 8/24/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > On 8/21/07, Andreas Kurz <[EMAIL PROTECTED]> wrote: > > Hello all, > > > > I have a Heartbeat 2.1.2 two-node cluster installation using three > > ping nodes to check network connectivity. Today one of the ping nodes > > was restarted and the loss of one ping node was detected by both nodes > > at the same time (according to the logs). > > > > The problem was that the pingd score_attribute was only decreased for > > one node (holding two resource groups) in a first attempt and > > heartbeat began to stop resources to migrate the groups away. During > > the resource shutdown the pingd score_attribute of the second node was > > also decreased and the resource migration was stopped and restarted on > > the current node. Some seconds later the third ping node was up again, > > the pingd score_attribute was updated for both nodes and the > > resources were untouched. > > > > My question is: Is there a way to 'tune' the configuration to avoid > > souch resource restarts and why was the pingd score_attribute updated > > 'simultaneously' when the ping node came up again but not when it got > > down? > > if you're using the RA, increase the value of "dampen" > > if an event happens, we'll wait 'dampen' seconds (or milliseconds) to > see if another one occurs on another node - so that we can update the > CIB with both of them at the same time.
According to my logs the pingds on both nodes recognized the loss of one ping-node at the same time but immediatly after the cib was updated once with the first pingd-attribute heartbeat started to stop the resources, then the second pingd-attribute update happens a second later and the already stopped resources were started again on the same host .... was this some sort of a race-condition? Should heartbeat maybe wait one additional ping intervall for pingd-attribute updates before starting the recalculation of the scores in case one node is a little bit late when sending the updates .... or does this make no nense? .. attrd[7171]: 2007/08/21_10:09:55 info: attrd_perform_update: Sent update 16: pingd=2000 tengine[932]: 2007/08/21_10:09:55 info: extract_event: Aborting on transient_attributes changes for 738e0605-7e82-47b8-b21a-e69b733eb98b ... tengine[932]: 2007/08/21_10:09:56 info: extract_event: Aborting on transient_attributes changes for dceade77-b3bf-40c7-a4b6-cc8995133aa1 Regards, Andreas > > > > > Regards, > > Andreas > > > > _______________________________________________ > > Linux-HA mailing list > > [email protected] > > http://lists.linux-ha.org/mailman/listinfo/linux-ha > > See also: http://linux-ha.org/ReportingProblems > > > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
