On Tue, 17 Dec 2013 09:28:51 +0100 Michael Schwartzkopff <m...@sys4.de> wrote:
> Am Dienstag, 17. Dezember 2013, 09:17:31 schrieb ma...@nucleus.it: > > Hi to all, > > i set up a 2 node cluster with a cross cable between the two nodes > > without stonith ; i know this is not the best way but this is the > > scenario i need at that time. > > > > I know the releases are old: > > corosync-1.2.7-1.2 > > libcorosync-1.2.7-1.2 > > pacemaker-1.0.10-1.4 > > libpacemaker3-1.0.10-1.4 > > > > Everything was ok for some days/months but a few day ago without > > network interruption ( no messages relative to ethernet modules or > > errors in network statistics or notifications by nagios ping checks > > ) between the two nodes something went wrong. > > > > From what i try to understand from the logs attached : > > Token Timeout (10000 ms) retransmit timeout (980 ms) > > token hold (774 ms) retransmits before loss (10 retrans) > > > > > > the 2 nodes lost a token and they try to solve the situation but > > node1 think node2 is up: > > > > Dec 7 05:01:41 node1 pengine: [1138]: info: > > determine_online_status: Node node2 is online > > Dec 7 05:01:41 node1 pengine: [1138]: info: > > determine_online_status: Node node1 is online > > > > and then lost > > > > Dec 7 05:01:54 node1 corosync[1128]: [pcmk ] info: > > ais_mark_unseen_peer_dead: Node node2 was not seen in the previous > > transition > > Dec 7 05:01:54 node1 corosync[1128]: [pcmk ] info: > > update_member: Node 33559980/node2 is now: lost > > > > while node2 think node1 was gone: > > > > Dec 7 05:01:34 node2 corosync[6356]: [pcmk ] info: > > ais_mark_unseen_peer_dead: Node node1 was not seen in the previous > > transition Dec 7 05:01:34 node2 corosync[6356]: [pcmk ] info: > > update_member: Node 16782764/node1 is now: lost > > > > then they go in spilt brain . > > Any suggestion about why node1 saw node2 ath the first time while > > node2 declared immediately lost node1 ? > > This depends who initiates the round. Both nodes recognized the > failure within 20 seconds. This is ok. Especially if you allow 10 > Sekunds for a token timeout. > > Mit freundlichen Grüßen, > > Michael Schwartzkopff > Ok that is fine but it is very strange without network loss between the nodes that they cannot resend the token and later restablish the quorum :( . Marco _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org