As a reference, I'm coming from a history of sysadmin of a variety of flavors of Linux & Windows - so if this is a stupidly simple "Solaris thing", a pointer in the right direction would be helpful.
I have a 4 node (mltstore0, mltstore1, mltproc0, mltproc1) cluster (mltcluster0) in which one of the nodes (mltproc0) won't join after reboot - stuck indefinitely in 'waiting for quorum' state. Since I'm not presented with a login prompt, and the node isn't listening to SSH connections from the network, I don't know how to figure out what it is really waiting for. If I start mltproc0 with the '-x' kernel boot parameter appended in GRUB, then it starts up without any problems, but of course not in the cluster. I don't know if this is related or not, but the iSCSI initiator keeps spitting out 'unable to connect to target' notices. There are two iSCSI targets defined, each with two paths, so I get a set of 4 iSCSI notices at a time. I have noticed that all the nodes get these messages early in the process, but once each node joins the cluster the notices stop - I presume this is because the iSCSI starts talking BEFORE clustering starts, and since iSCSI is talking over the cluster interconnect paths which don't exist until clustering startup creates the interconnects, iSCSI can't reach the targets until later on. Since each node has a local mirrored zpool as it's rpool to boot from, I'd prefer if I could tell iSCSI not to bother looking until later in the process, as it seems that would both remove the errors and boot faster, but if that's not causing my problem I'm not too worried. I'm not sure what diagnostic steps to take, or what information is needed to show where I'm doing something wrong. I'm guessing that I did something subtly different on mltproc0 from what I did on the other nodes, but I am not sure what at this point. mltproc1# dladm show-vlan LINK VID OVER FLAGS vcmguest0 7 e1000g0 ----- (this is my route to the internet) vmltmain0 20 e1000g0 ----- (this is the 'to clients' facing vlan) vmltsysadmin0 21 e1000g0 ----- (this is for my 'admin only' functions - switch confguration, UPSes, SNMP traffic, etc) vmltx1 24 rge0 ----- vmltx2 25 e1000g0 ----- The two vlans vmltx1 and vmltx2 are the private interconnect for the cluster. All nodes show the same results for dladm. mltproc1:~# /usr/cluster/bin/clnode status === Cluster Nodes === --- Node Status --- Node Name Status --------- ------ mltproc0 Offline mltproc1 Online mltstore1 Online mltstore0 Online mltproc1:~# /usr/cluster/bin/clinterconnect status === Cluster Transport Paths === Endpoint1 Endpoint2 Status --------- --------- ------ mltproc1:vmltx2 mltstore1:vmltx2 Path online mltproc0:vmltx2 mltproc1:vmltx2 faulted mltproc1:vmltx2 mltstore0:vmltx2 Path online mltproc0:vmltx1 mltproc1:vmltx1 faulted mltproc1:vmltx1 mltstore1:vmltx1 Path online mltproc1:vmltx1 mltstore0:vmltx1 Path online mltstore1:vmltx2 mltstore0:vmltx2 Path online mltproc0:vmltx2 mltstore1:vmltx2 faulted mltstore1:vmltx1 mltstore0:vmltx1 Path online mltproc0:vmltx1 mltstore1:vmltx1 faulted mltproc0:vmltx2 mltstore0:vmltx2 faulted mltproc0:vmltx1 mltstore0:vmltx1 faulted If I do 'snoop -d vmltx1' on one of the other nodes, I see ARP requests for the IP address that 'clinterconnect' shows for mltproc0:vmltx1, but no responses back. If I boot mltproc0 in non-cluster mode, and do a 'snoop -d vmltx1', I see the same ARP requests. On vmltx2, I see the same ARP requests on either an in-cluster node or mltproc0. -- This message posted from opensolaris.org