Re: [Pacemaker] cluster-delay property

Michael Schwartzkopff Thu, 24 Oct 2013 07:18:00 -0700

Am Donnerstag, 24. Oktober 2013, 10:06:20 schrieben Sie:
> On 24/10/13 09:01, Michael Schwartzkopff wrote:
> > Am Donnerstag, 24. Oktober 2013, 14:39:39 schrieb Karl Rößmann:
> >> Sorry, I try to explain
> >>
> >> Hi
> >>
> >> In your book you describe a parameter 'deadtime' which defines
> >> the timeout to declare a node as dead. I want to extend this
> >> value to 120s to avoid such a scenario
> >>
> >> But: in the SuSE documentation I cannot find 'deadtime', instead
> >> I see a value 'cluster-delay'. My Question is: Are these two
> >> parameters equivalent ?
> >>
> >> More details about the scenario: The I/O load was created by me,
> >> because I copied a large xen image to an logical volume of the
> >> cLVM (using 'dd'). I did it several times before without
> >> problems. Maybe something changed after upgrading tu SLES SP3.
> >>
> >> One node, (it was the DC) died, the Xen resources went to the
> >> surviving node. Fine.
> >>
> >> No information in the log file.
> >>
> >> On the the surviving node I see: Oct 23 09:30:41 ha2infra
> >> corosync[9085]:  [TOTEM ] A processor failed, forming new
> >> configuration.
> >
> > (...)
> >
> > the log says that corosync did not see the node. This is not a
> > pacemaker problem.
> >
> > I speculate that this happened because one node was heavily
> > overloaded doing the dd and did not find to process the corosync
> > tokens in time. Or perhaps the load on the network was so high that
> > corosync packets were dropped.
> >
> > Anyway: This is not a pacemaker problem, it is a corosync problem.
> >
> > If you want to make corosync bahave a little bit more relaxed
> > please see "man corosync.conf" for the options. Look for the
> > options token and the following options. I don't know what options
> > are available in SLES11 HAE3. corosync is under heavy improvement
> > ;-)
> >
> > If you have a question for a specific option please ask here on the
> > list.
>
> I agree with Michael that this is a corosync problem. I also agree
> that this is a congestion problem. The variable you are looking for is
> token_retransmit, if I am correct.
>
> I would argue that the better solution is not to adjust this value,
> but to fixed your architecture to separate corosync/pacemaker traffic
> from the disk/dd traffic. If you increase token_retransmit, you will
> delay how long real failures take to be detected, thus slowing down
> recovery.


Of course, fiddeling around with the token_retransmit option doesn't solve the
problem. It just cures the symptoms.

Perhaps you limit the transfer rate of dd. google for "dd rate limit". There
are several solutions. rsync/csync could be a solution.

Also you could think about improving your disk I/O sub-system.

But you better know what the bottle neck in your system is and how to solve
it.

Mit freundlichen Grüßen,

Michael Schwartzkopff

--
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Axel von der Ohe, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Pacemaker mailing list: [email protected]
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] cluster-delay property

Reply via email to