[ha-clusters-discuss] 4 node cluster, one node 'waiting for quorum'

Tundra Slosek Thu, 15 Oct 2009 06:58:20 PDT

As a reference, I'm coming from a history of sysadmin of a variety of flavors 
of Linux & Windows - so if this is a stupidly simple "Solaris thing", a pointer 
in the right direction would be helpful.


I have a 4 node (mltstore0, mltstore1, mltproc0, mltproc1) cluster 
(mltcluster0) in which one of the nodes (mltproc0) won't join after reboot - 
stuck indefinitely in 'waiting for quorum' state. Since I'm not presented with 
a login prompt, and the node isn't listening to SSH connections from the 
network, I don't know how to figure out what it is really waiting for.

If I start mltproc0 with the '-x' kernel boot parameter appended in GRUB, then 
it starts up without any problems, but of course not in the cluster.

I don't know if this is related or not, but the iSCSI initiator keeps spitting 
out 'unable to connect to target' notices. There are two iSCSI targets defined, 
each with two paths, so I get a set of 4 iSCSI notices at a time. I have 
noticed that all the nodes get these messages early in the process, but once 
each node joins the cluster the notices stop - I presume this is because the 
iSCSI starts talking BEFORE clustering starts, and since iSCSI is talking over 
the cluster interconnect paths which don't exist until clustering startup 
creates the interconnects, iSCSI can't reach the targets until later on. Since 
each node has a local mirrored zpool as it's rpool to boot from, I'd prefer if 
I could tell iSCSI not to bother looking until later in the process, as it 
seems that would both remove the errors and boot faster, but if that's not 
causing my problem I'm not too worried. 

I'm not sure what diagnostic steps to take, or what information is needed to 
show where I'm doing something wrong. I'm guessing that I did something subtly 
different on mltproc0 from what I did on the other nodes, but I am not sure 
what at this point.

mltproc1# dladm show-vlan
LINK            VID      OVER         FLAGS
vcmguest0       7        e1000g0      ----- (this is my route to the internet)
vmltmain0       20       e1000g0      ----- (this is the 'to clients' facing 
vlan)
vmltsysadmin0   21       e1000g0      ----- (this is for my 'admin only' 
functions - switch confguration, UPSes, SNMP traffic, etc)
vmltx1          24       rge0         -----
vmltx2          25       e1000g0      -----

The two vlans vmltx1 and vmltx2 are the private interconnect for the cluster. 
All nodes show the same results for dladm.

mltproc1:~# /usr/cluster/bin/clnode status

=== Cluster Nodes ===

--- Node Status ---

Node Name                                       Status
---------                                       ------
mltproc0                                        Offline
mltproc1                                        Online
mltstore1                                       Online
mltstore0                                       Online

mltproc1:~# /usr/cluster/bin/clinterconnect status

=== Cluster Transport Paths ===

Endpoint1               Endpoint2               Status
---------               ---------               ------
mltproc1:vmltx2         mltstore1:vmltx2        Path online
mltproc0:vmltx2         mltproc1:vmltx2         faulted
mltproc1:vmltx2         mltstore0:vmltx2        Path online
mltproc0:vmltx1         mltproc1:vmltx1         faulted
mltproc1:vmltx1         mltstore1:vmltx1        Path online
mltproc1:vmltx1         mltstore0:vmltx1        Path online
mltstore1:vmltx2        mltstore0:vmltx2        Path online
mltproc0:vmltx2         mltstore1:vmltx2        faulted
mltstore1:vmltx1        mltstore0:vmltx1        Path online
mltproc0:vmltx1         mltstore1:vmltx1        faulted
mltproc0:vmltx2         mltstore0:vmltx2        faulted
mltproc0:vmltx1         mltstore0:vmltx1        faulted

If I do 'snoop -d vmltx1' on one of the other nodes, I see ARP requests for the 
IP address that 'clinterconnect' shows for mltproc0:vmltx1, but no responses 
back.

If I boot mltproc0 in non-cluster mode, and do a 'snoop -d vmltx1', I see the 
same ARP requests. On vmltx2, I see the same ARP requests on either an 
in-cluster node or mltproc0.
-- 
This message posted from opensolaris.org

[ha-clusters-discuss] 4 node cluster, one node 'waiting for quorum'

Reply via email to