On 9/13/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > On 8/30/07, Andreas Kurz <[EMAIL PROTECTED]> wrote: > > On 8/24/07, Andrew Beekhof <[EMAIL PROTECTED]> wrote: > > > On 8/21/07, Andreas Kurz <[EMAIL PROTECTED]> wrote: > > > > Hello all, > > > > > > > > I have a Heartbeat 2.1.2 two-node cluster installation using three > > > > ping nodes to check network connectivity. Today one of the ping nodes > > > > was restarted and the loss of one ping node was detected by both nodes > > > > at the same time (according to the logs). > > > > > > > > The problem was that the pingd score_attribute was only decreased for > > > > one node (holding two resource groups) in a first attempt and > > > > heartbeat began to stop resources to migrate the groups away. During > > > > the resource shutdown the pingd score_attribute of the second node was > > > > also decreased and the resource migration was stopped and restarted on > > > > the current node. Some seconds later the third ping node was up again, > > > > the pingd score_attribute was updated for both nodes and the > > > > resources were untouched. > > > > > > > > My question is: Is there a way to 'tune' the configuration to avoid > > > > souch resource restarts and why was the pingd score_attribute updated > > > > 'simultaneously' when the ping node came up again but not when it got > > > > down? > > > > > > if you're using the RA, increase the value of "dampen" > > > > > > if an event happens, we'll wait 'dampen' seconds (or milliseconds) to > > > see if another one occurs on another node - so that we can update the > > > CIB with both of them at the same time. > > > > According to my logs the pingds on both nodes recognized the loss of > > one ping-node at the same time but immediatly after the cib was > > updated once with the first pingd-attribute heartbeat started to stop > > the resources, then the second pingd-attribute update happens a second > > later and the already stopped resources were started again on the same > > host .... was this some sort of a race-condition? Should heartbeat > > maybe wait one additional ping intervall for pingd-attribute updates > > before starting the recalculation of the scores in case one node is a > > little bit late when sending the updates .... or does this make no > > nense? > > in the current design of attrd, there is a small chance that the node > that triggers the update will get in a little too quickly and the > updates don't show up close enough together. basically thats what > happened here. > > the real solution is to have all peers supply their changes to one > node that does the update (hence ensuring the updates are truly > atomic) > > we know what we need to do, its just a matter of getting the time to do it...
OK ... I see ... Thanks for your reply and the information Andrew. Regards, Andreas > > > .. > > attrd[7171]: 2007/08/21_10:09:55 info: attrd_perform_update: Sent > > update 16: pingd=2000 > > tengine[932]: 2007/08/21_10:09:55 info: extract_event: Aborting on > > transient_attributes changes for 738e0605-7e82-47b8-b21a-e69b733eb98b > > ... > > tengine[932]: 2007/08/21_10:09:56 info: extract_event: Aborting on > > transient_attributes changes for dceade77-b3bf-40c7-a4b6-cc8995133aa1 > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
