Hi,
I have 4 nodes,
srv01, srv02, srv03 are active, and srv04 is standby.
When srv01's interconnect LANs are unplugged, split brain would come.
So srv04 tries to start the resource which has already running on the other
node,
but it ends up with "Timed Out".
After this "Timed Out", I re-connect the LANs.
Heartbeat can handle the recovering from split brain, and the resource would
be restarted.
The default values of "multiple_active" is "stop_start", so the resource
might do stop and start after split brain.
So I set multiple_active="block" as a trial.
The resource are blocked as I had expected,
but the resource which is set in the same group is stopped...
It seems that group order constraint controls this behavior.
It makes sense.
But we want the resource to get keep running (without restart) after
recovering from split brain.
Is there any good parameter for this?
Test case
(1) start the resource
Node: srv04 (007c1358-aab5-488e-9b97-a0ad66c5873d): online
Node: srv03 (b4456190-25f1-459f-826c-b9136000857e): online
Node: srv02 (007c1358-aab5-488e-9b97-a0ad66c5873b): online
Node: srv01 (6f4b0dac-ff9c-4941-a186-a9dbab53da96): online
Resource Group: sfex1
prmExPostgreSQLDB1 (ocf::heartbeat:sfex): Started srv01
dummy1 (ocf::heartbeat:Dummy): Started srv01
Resource Group: sfex2
prmExPostgreSQLDB2 (ocf::heartbeat:sfex): Started srv02
dummy2 (ocf::heartbeat:Dummy): Started srv02
Resource Group: sfex3
prmExPostgreSQLDB3 (ocf::heartbeat:sfex): Started srv03
dummy3 (ocf::heartbeat:Dummy): Started srv03
(2) unplugged srv01's interconnect LAN
(3) prmExPostgreSQLDB1 timed out on srv04
(4) reconnect srv01's interconnect LAN
(5) prmExPostgreSQLDB2 and prmExPostgreSQLDB3 timed out on srv04
(6) because prmExPostgreSQLDB2 and prmExPostgreSQLDB3 are blocked because of
multiple_active="block"
(7) dummy2 and dummy3 are stoppped with group order constraint
Node: srv04 (007c1358-aab5-488e-9b97-a0ad66c5873d): online
Node: srv03 (b4456190-25f1-459f-826c-b9136000857e): online
Node: srv02 (007c1358-aab5-488e-9b97-a0ad66c5873b): online
Node: srv01 (6f4b0dac-ff9c-4941-a186-a9dbab53da96): online
Resource Group: sfex1
prmExPostgreSQLDB1 (ocf::heartbeat:sfex): Started srv01
dummy1 (ocf::heartbeat:Dummy): Started srv01
Resource Group: sfex2
prmExPostgreSQLDB2 (ocf::heartbeat:sfex)[ srv04 srv02 ]
dummy2 (ocf::heartbeat:Dummy): Stopped * can keep dummy2 running?
Resource Group: sfex3
prmExPostgreSQLDB3 (ocf::heartbeat:sfex)[ srv04 srv03 ]
dummy3 (ocf::heartbeat:Dummy): Stopped * can keep dummy3 running?
Failed actions:
prmExPostgreSQLDB1_start_0 (node=srv04, call=8, rc=-2): Timed Out
prmExPostgreSQLDB2_start_0 (node=srv04, call=10, rc=-2): Timed Out
prmExPostgreSQLDB3_start_0 (node=srv04, call=11, rc=-2): Timed Out
If possible, I want keep the resource running on the same node without
restarting.
Sorry for complex case...
The log size is big, so I open the bugzilla.
http://developerbugs.linux-foundation.org//show_bug.cgi?id=2004
Best Regards,
Junko Ikeda
NTT DATA INTELLILINK CORPORATION
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems