Hi, Yesterday we've encountered interesting behaviour of SC 3.2 with quorum server and somehow I feel that the way cluster behaves is wrong, but it could me who is wrong.
We have two node cluster with TrueCopy replicated storage and quorum server at remote location, no other quorum devices. We were testing different fault scanarios, i.e. disk failure, SAN fiber failures, public interfaces failures, interconnect failures and problem that bothers me is related to last one, cluster interconnect failure. Public interfaces were connected, quorum server was up and running and perfectly accessible, however as soon as we disconnected both cluster interconnect cables active node with all resource groups on it crashed with kernel panic and all resource groups failed over to standby node. I see following messages in logs: Jul 10 11:36:11 isksdbnp01 cl_runtime: [ID 489438 kern.notice] NOTICE: clcomm: Path isksdbnp01:e1000g0 - ldrsdbnp01:e1000g0 being drained Jul 10 11:37:11 isksdbnp01 cl_runtime: [ID 604153 kern.notice] NOTICE: clcomm: Path isksdbnp01:e1000g0 - ldrsdbnp01:e1000g0 errors during initiation Jul 10 11:37:11 isksdbnp01 cl_runtime: [ID 618107 kern.warning] WARNING: Path isksdbnp01:e1000g0 - ldrsdbnp01:e1000g0 initiation encountered errors, errno = 62. Remote node may be down or unreachable through this path. Jul 10 11:37:21 isksdbnp01 sg: [ID 266374 kern.notice] Symantec SCSA Generic Revision: 3.6 Jul 10 11:41:27 isksdbnp01 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN: local = 000.000.000.000:0, remote = 010.011.123.018:0, start = -2, end = 6 Jul 10 11:41:27 isksdbnp01 cl_runtime: [ID 266834 kern.warning] WARNING: CMM: Our partition has been preempted. Jul 10 11:41:29 isksdbnp01 cl_dlpitrans: [ID 624622 kern.notice] Notifying cluster that this node is panicking Jul 10 11:41:29 isksdbnp01 unix: [ID 836849 kern.notice] Jul 10 11:41:29 isksdbnp01 ^Mpanic[cpu40]/thread=3000c3970e0: Jul 10 11:41:29 isksdbnp01 unix: [ID 265925 kern.notice] CMM: Cluster lost operational quorum; aborting. Jul 10 11:41:29 isksdbnp01 unix: [ID 100000 kern.notice] Jul 10 11:41:29 isksdbnp01 genunix: [ID 723222 kern.notice] 000002a1017f3540 cl_runtime:__1cZsc_syslog_msg_log_no_args6Fpviipkc0_nZsc_syslog_msg_status_enum_ _+30 (60046fd0800, 3, 0, 43, 2a1017f3740, 705ccb67) Jul 10 11:41:29 isksdbnp01 genunix: [ID 179002 kern.notice] %l0-3: 00000000705cc6d0 000000000000004c 000006003cd3dee6 000000000000004c Jul 10 11:41:29 isksdbnp01 %l4-7: 000000001092366f 000006003cd33285 0000000000000000 00000000701c3000 So it's obvious from logs that cluster thought that it had no majority votes and shot itself in the head, which is absolutely normal behaviour in situation if it has only 1 vote. But as far as I understand if quorum server was accessible (it was) and we just lost cluster interconnect active node should still have 2 votes out of three and should continue operation in standalone mode. I can play with this whole day and always get exactly the same results and we have second cluster on second domain of same M9000 boxes which behaves exactly the same. I'd assume that in situation when both nodes have access to quorum server but can not communicate between each other currently active and running node should have some kind of higher priority and there's absolutely no reason why node should panic and why it can't continue operation. Is this another case of "works as designed, but probably not as desired" (c) Sun Support or am I missing something ? Thanks in advance ! Sergei -- This message posted from opensolaris.org