[Linux-HA] resource restart after recovering split brain

Junko IKEDA Thu, 27 Nov 2008 00:08:26 -0800

Hi,

I have 4 nodes,
srv01, srv02, srv03 are active, and srv04 is standby.


When srv01's interconnect LANs are unplugged, split brain would come.
So srv04 tries to start the resource which has already running on the other
node,
but it ends up with "Timed Out".
After this "Timed Out", I re-connect the LANs.
Heartbeat can handle the recovering from split brain, and the resource would
be restarted.
The default values of "multiple_active" is "stop_start", so the resource
might do stop and start after split brain.
So I set multiple_active="block" as a trial.
The resource are blocked as I had expected,
but the resource which is set in the same group is stopped...
It seems that group order constraint controls this behavior.
It makes sense.
But we want the resource to get keep running (without restart) after
recovering from split brain.
Is there any good parameter for this?

Test case
(1) start the resource

Node: srv04 (007c1358-aab5-488e-9b97-a0ad66c5873d): online
Node: srv03 (b4456190-25f1-459f-826c-b9136000857e): online
Node: srv02 (007c1358-aab5-488e-9b97-a0ad66c5873b): online
Node: srv01 (6f4b0dac-ff9c-4941-a186-a9dbab53da96): online

Resource Group: sfex1
    prmExPostgreSQLDB1  (ocf::heartbeat:sfex):  Started srv01
    dummy1      (ocf::heartbeat:Dummy): Started srv01
Resource Group: sfex2
    prmExPostgreSQLDB2  (ocf::heartbeat:sfex):  Started srv02
    dummy2      (ocf::heartbeat:Dummy): Started srv02
Resource Group: sfex3
    prmExPostgreSQLDB3  (ocf::heartbeat:sfex):  Started srv03
    dummy3      (ocf::heartbeat:Dummy): Started srv03

(2) unplugged srv01's interconnect LAN
(3) prmExPostgreSQLDB1 timed out on srv04
(4) reconnect srv01's interconnect LAN
(5) prmExPostgreSQLDB2 and prmExPostgreSQLDB3 timed out on srv04
(6) because prmExPostgreSQLDB2 and prmExPostgreSQLDB3 are blocked because of
multiple_active="block"
(7) dummy2 and dummy3 are stoppped with group order constraint

Node: srv04 (007c1358-aab5-488e-9b97-a0ad66c5873d): online
Node: srv03 (b4456190-25f1-459f-826c-b9136000857e): online
Node: srv02 (007c1358-aab5-488e-9b97-a0ad66c5873b): online
Node: srv01 (6f4b0dac-ff9c-4941-a186-a9dbab53da96): online

Resource Group: sfex1
    prmExPostgreSQLDB1  (ocf::heartbeat:sfex):  Started srv01
    dummy1      (ocf::heartbeat:Dummy): Started srv01
Resource Group: sfex2
    prmExPostgreSQLDB2  (ocf::heartbeat:sfex)[  srv04   srv02 ]
    dummy2      (ocf::heartbeat:Dummy): Stopped * can keep dummy2 running?
Resource Group: sfex3
    prmExPostgreSQLDB3  (ocf::heartbeat:sfex)[  srv04   srv03 ]
    dummy3      (ocf::heartbeat:Dummy): Stopped * can keep dummy3 running?

Failed actions:
    prmExPostgreSQLDB1_start_0 (node=srv04, call=8, rc=-2): Timed Out
    prmExPostgreSQLDB2_start_0 (node=srv04, call=10, rc=-2): Timed Out
    prmExPostgreSQLDB3_start_0 (node=srv04, call=11, rc=-2): Timed Out

If possible, I want keep the resource running on the same node without
restarting.
Sorry for complex case...

The log size is big, so I open the bugzilla.
http://developerbugs.linux-foundation.org//show_bug.cgi?id=2004

Best Regards,
Junko Ikeda

NTT DATA INTELLILINK CORPORATION


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] resource restart after recovering split brain

Reply via email to