Hi,

Yesterday we've encountered interesting behaviour of SC 3.2 with quorum server 
and somehow I feel that the way cluster behaves is wrong, but it could me who 
is wrong.

We have two node cluster with TrueCopy replicated storage and quorum server at 
remote location, no other quorum devices. We were testing different fault 
scanarios, i.e. disk failure, SAN fiber failures, public interfaces failures, 
interconnect failures and problem that bothers me is related to last one, 
cluster interconnect failure. Public interfaces were connected, quorum server 
was up and running and perfectly accessible, however as soon as we disconnected 
both cluster interconnect cables active node with all resource groups on it 
crashed with kernel panic and all resource groups failed over to standby node.

I see following messages in logs:

Jul 10 11:36:11 isksdbnp01 cl_runtime: [ID 489438 kern.notice] NOTICE: clcomm: 
Path isksdbnp01:e1000g0 - ldrsdbnp01:e1000g0 being drained
Jul 10 11:37:11 isksdbnp01 cl_runtime: [ID 604153 kern.notice] NOTICE: clcomm: 
Path isksdbnp01:e1000g0 - ldrsdbnp01:e1000g0 errors during initiation
Jul 10 11:37:11 isksdbnp01 cl_runtime: [ID 618107 kern.warning] WARNING: Path 
isksdbnp01:e1000g0 - ldrsdbnp01:e1000g0 initiation encountered errors, errno =
62. Remote node may be down or unreachable through this path.
Jul 10 11:37:21 isksdbnp01 sg: [ID 266374 kern.notice] Symantec SCSA Generic 
Revision: 3.6
Jul 10 11:41:27 isksdbnp01 ip: [ID 678092 kern.notice] TCP_IOC_ABORT_CONN: 
local = 000.000.000.000:0, remote = 010.011.123.018:0, start = -2, end = 6
Jul 10 11:41:27 isksdbnp01 cl_runtime: [ID 266834 kern.warning] WARNING: CMM: 
Our partition has been preempted.
Jul 10 11:41:29 isksdbnp01 cl_dlpitrans: [ID 624622 kern.notice] Notifying 
cluster that this node is panicking
Jul 10 11:41:29 isksdbnp01 unix: [ID 836849 kern.notice]
Jul 10 11:41:29 isksdbnp01 ^Mpanic[cpu40]/thread=3000c3970e0:
Jul 10 11:41:29 isksdbnp01 unix: [ID 265925 kern.notice] CMM: Cluster lost 
operational quorum; aborting.
Jul 10 11:41:29 isksdbnp01 unix: [ID 100000 kern.notice]
Jul 10 11:41:29 isksdbnp01 genunix: [ID 723222 kern.notice] 000002a1017f3540 
cl_runtime:__1cZsc_syslog_msg_log_no_args6Fpviipkc0_nZsc_syslog_msg_status_enum_
_+30 (60046fd0800, 3, 0, 43, 2a1017f3740, 705ccb67)
Jul 10 11:41:29 isksdbnp01 genunix: [ID 179002 kern.notice]   %l0-3: 
00000000705cc6d0 000000000000004c 000006003cd3dee6 000000000000004c
Jul 10 11:41:29 isksdbnp01   %l4-7: 000000001092366f 000006003cd33285 
0000000000000000 00000000701c3000

So it's obvious from logs that cluster thought that it had no majority votes 
and shot itself in the head, which is absolutely normal behaviour in situation 
if it has only 1 vote. But as far as I understand if quorum server was 
accessible (it was) and we just lost cluster interconnect active node should 
still have 2 votes out of three and should continue operation in standalone 
mode. I can play with this whole day and always get exactly the same results 
and we have second cluster on second domain of same M9000 boxes which behaves 
exactly the same. 

I'd assume that in situation when both nodes have access to quorum server but 
can not communicate between each other currently active and running node should 
have some kind of higher priority and there's absolutely no reason why node 
should panic and why it can't continue operation. Is this another case of 
"works as designed, but probably not as desired" (c) Sun Support  or am I 
missing something ? 

Thanks in advance !
Sergei
-- 
This message posted from opensolaris.org

Reply via email to