Am Donnerstag, 24. Oktober 2013, 10:06:20 schrieben Sie: > On 24/10/13 09:01, Michael Schwartzkopff wrote: > > Am Donnerstag, 24. Oktober 2013, 14:39:39 schrieb Karl Rößmann: > >> Sorry, I try to explain > >> > >> Hi > >> > >> In your book you describe a parameter 'deadtime' which defines > >> the timeout to declare a node as dead. I want to extend this > >> value to 120s to avoid such a scenario > >> > >> But: in the SuSE documentation I cannot find 'deadtime', instead > >> I see a value 'cluster-delay'. My Question is: Are these two > >> parameters equivalent ? > >> > >> More details about the scenario: The I/O load was created by me, > >> because I copied a large xen image to an logical volume of the > >> cLVM (using 'dd'). I did it several times before without > >> problems. Maybe something changed after upgrading tu SLES SP3. > >> > >> One node, (it was the DC) died, the Xen resources went to the > >> surviving node. Fine. > >> > >> No information in the log file. > >> > >> On the the surviving node I see: Oct 23 09:30:41 ha2infra > >> corosync[9085]: [TOTEM ] A processor failed, forming new > >> configuration. > > > > (...) > > > > the log says that corosync did not see the node. This is not a > > pacemaker problem. > > > > I speculate that this happened because one node was heavily > > overloaded doing the dd and did not find to process the corosync > > tokens in time. Or perhaps the load on the network was so high that > > corosync packets were dropped. > > > > Anyway: This is not a pacemaker problem, it is a corosync problem. > > > > If you want to make corosync bahave a little bit more relaxed > > please see "man corosync.conf" for the options. Look for the > > options token and the following options. I don't know what options > > are available in SLES11 HAE3. corosync is under heavy improvement > > ;-) > > > > If you have a question for a specific option please ask here on the > > list. > > I agree with Michael that this is a corosync problem. I also agree > that this is a congestion problem. The variable you are looking for is > token_retransmit, if I am correct. > > I would argue that the better solution is not to adjust this value, > but to fixed your architecture to separate corosync/pacemaker traffic > from the disk/dd traffic. If you increase token_retransmit, you will > delay how long real failures take to be detected, thus slowing down > recovery.
Of course, fiddeling around with the token_retransmit option doesn't solve the problem. It just cures the symptoms. Perhaps you limit the transfer rate of dd. google for "dd rate limit". There are several solutions. rsync/csync could be a solution. Also you could think about improving your disk I/O sub-system. But you better know what the bottle neck in your system is and how to solve it. Mit freundlichen Grüßen, Michael Schwartzkopff -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Pacemaker mailing list: [email protected] http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
