Hi I got the OpenSolaris 2009.06 HA cluster installed on both hosts and add quorum server to the cluster. Both of the hosts are online and now I am trying to setting up SVM. I created 2 diskset, with 15 drives for each diskset as below:
root at xCid:# metaset Set name = xCid_diskSet, Set number = 1 Host Owner xCid Yes xCloud Drive Dbase d1 Yes d2 Yes d3 Yes d4 Yes d5 Yes d6 Yes d7 Yes d8 Yes d9 Yes d10 Yes d11 Yes d12 Yes d13 Yes d14 Yes d15 Yes d16 Yes Set name = xCloud_diskSet, Set number = 2 Host Owner xCid Yes xCloud Drive Dbase d17 Yes d18 Yes d19 Yes d20 Yes d21 Yes d22 Yes d23 Yes d24 Yes d25 Yes d26 Yes d27 Yes d28 Yes d29 Yes d30 Yes d31 Yes root at xCid:# metaset | grep Set Set name = xCid_diskSet, Set number = 1 Set name = xCloud_diskSet, Set number = 2 root at xCid:# And for some reason, I am not able to check the metadevices in md.tab file by running command "metainit" and the error that I get is "device not in set" root at xCid:# metainit -n -a -s xCid_diskSet metainit: xCid: /etc/lvm/md.tab line 81: d1s0: device not in set metainit: xCid: /etc/lvm/md.tab line 82: d2s0: device not in set metainit: xCid: /etc/lvm/md.tab line 83: d3s0: device not in set metainit: xCid: /etc/lvm/md.tab line 84: d4s0: device not in set metainit: xCid: /etc/lvm/md.tab line 85: d5s0: device not in set metainit: xCid: /etc/lvm/md.tab line 86: d6s0: device not in set metainit: xCid: /etc/lvm/md.tab line 87: d7s0: device not in set I attached you the md.tab file with this email. Can you please help me to check if the format of the entries in md.tab file are correct and why I got the error message above? Thanks a lot for your time. Janey -----Original Message----- From: Sambit.Nayak at Sun.COM [mailto:sambit.na...@sun.com] Sent: Wednesday, September 16, 2009 1:16 AM To: Le, Janey Subject: Re: [ha-clusters-discuss] Host panic - OpenSolaris SunCluster Hi Janey, Yes, this looks correct. It means xCid is the owner of the quorum server. The idea is : one of the cluster member nodes becomes the owner of the quorum device (has the reservation), and all cluster member nodes have their registration keys present on the quorum device. So this looks right. *********** One more thing : Quorum server/device can intermittently stop responding - may be due to path or network failures. Hence "scstat -q" and "clquorum status" are enhanced to show the immediate status of the quorum device/server. So if a node is having problems accessing the quorum device/server, executing one of the above commands on that node will report you that. If a node is unable to access the quorum device/server, the command on that node will say the device/server is offline. The command output on xCid, as shown below, looks great - xCid can access the quorum server quite fine. You can run the same command on the other node (xCloud) to see whether it is able to access the quorum server fine. Thanks & Regards, Sambit Le, Janey wrote: > Hi Sambit, > > I am in the process of setting up the cluster again, and after > installed the cluster on both of the nodes, I added quorum server to > the cluster. Below is what I see on the quorum server after adding > the quorum server > > root at Auron:~# clquorumserver show + > === Quorum Server on port 9000 === > > Disabled False > > > --- Cluster CC_Cluster (id 0x4AAEB956) Reservation --- > > Node ID: 1 > Reservation key: 0x4aaeb95600000001 > > --- Cluster CC_Cluster (id 0x4AAEB956) Registrations --- > > Node ID: 1 > Registration key: 0x4aaeb95600000001 > > Node ID: 2 > Registration key: 0x4aaeb95600000002 > > root at Auron:~# > > > ***There is only one Reservation key, is it correct? Should we have > Reservation key for Node ID: 2 too? > > From the 1st cluster node status of the cluster > > root at xCid:~# scstat -q > > -- Quorum Summary from latest node reconfiguration -- > > Quorum votes possible: 3 > Quorum votes needed: 2 > Quorum votes present: 3 > > > -- Quorum Votes by Node (current status) -- > > Node Name Present Possible Status > --------- ------- -------- ------ > Node votes: xCid 1 1 Online > Node votes: xCloud 1 1 Online > > > -- Quorum Votes by Device (current status) -- > > Device Name Present Possible Status > ----------- ------- -------- ------ > Device votes: Auron 1 1 Online > > root at xCid:~# > > root at xCid:~# clnode show > > === Cluster Nodes === > > Node Name: xCid > Node ID: 1 > Enabled: yes > privatehostname: clusternode1-priv > reboot_on_path_failure: disabled > globalzoneshares: 1 > defaultpsetmin: 1 > quorum_vote: 1 > quorum_defaultvote: 1 > quorum_resv_key: 0x4AAEB95600000001 > Transport Adapter List: e1000g2, e1000g3 > > Node Name: xCloud > Node ID: 2 > Enabled: yes > privatehostname: clusternode2-priv > reboot_on_path_failure: disabled > globalzoneshares: 1 > defaultpsetmin: 1 > quorum_vote: 1 > quorum_defaultvote: 1 > quorum_resv_key: 0x4AAEB95600000002 > Transport Adapter List: e1000g2, e1000g3 > > root at xCid:~# > > > > Thanks a lot for your time. > > Janey > > -----Original Message----- > From: Sambit.Nayak at Sun.COM [mailto:Sambit.Nayak at Sun.COM] > Sent: Wednesday, September 09, 2009 1:53 AM > To: Le, Janey > Cc: ha-clusters-discuss at opensolaris.org > Subject: Re: [ha-clusters-discuss] Host panic - OpenSolaris SunCluster > > Hi Janey, > > The error message : > > WARNING: CMM: Reading reservation keys from quorum device Auron > failed with error 2. > means that the node xCid failed to read the quorum keys from the > quorum server host Auron. > > It is possible that the node xCid could not contact the quorum server > host Auron at that specific time, when the reconfiguration was in > progress due to reboot of xCloud. > Also please look in the syslog messages on xCid, for any failure > messages related to the quorum server. > > Is the problem happening everytime you reboot xCloud, after both nodes > are successfully online in the cluster? > > > Things that can be done to debug > -------------------------------------------- > More information about this failure will be available in the cluster > kernel trace buffers. > > If you can obtain the kernel dumps of the cluster nodes at the time of > this problem, then we can look into them to debug the problem more. > If you are not able to provide the dumps, then please do : > > *cmm_dbg_buf/s > > *(cmm_dbg_buf+8)+1/s > at the kmdb prompt resulting from the panic (or on the saved crash > dump), and provide that output. > > Please also save the /var/adm/messages of both nodes. > > Each quorum server daemon (on a quorum server host) has an associated > directory where it stores some files. > By default, /var/scqsd is the directory used by a quorum server daemon. > If you have changed the default directory while configuring the quorum > server, then please look in it instead. > There will be files named ".scqsd_dbg_buf*" in such a directory. > Please provide those files as well; they will tell us what's happening > on the quorum server host Auron. > > If you execute "clquorum status" command on a cluster node, then it > will tell if the local node can access the quorum server at the time > of this command execution or not. If access is possible and the node's > keys are present, the quorum server is marked online; else it is marked > offline. > So if you execute this command on both cluster nodes before doing the > experiment of rebooting xCloud, then that will tell whether any node > was having problems accessing the quorum server. > Please run that command on both nodes, and capture the output, before > rebooting xCloud. > > Similarly, the "clquorumserver show +" on the quorum server host will > tell what cluster it is serving, and what keys are present on the > quorum server, which cluster node is the owner of the quorum server, etc. > Please capture its output before rebooting xCloud, and after xCid > panics as a result of rebooting xCloud. > > ************ > > Just as a confirmation, the cluster is running Open HA Cluster > 2009.06, and you are using the quorum server packages available with > Open HA Cluster 2009.06, right? > > Thanks & Regards, > Sambit > > > Janey Le wrote: > >> After setting up SunCluster on OpenSolaris, and when I reboot the second >> node of the cluster, my first node panic. Can you please let me know if >> there is anyone that I can contact to know if this is setup issue or it is >> cluster bug? >> >> Below is the setup that I had: >> >> - 2x1 ( 2 OpenSolaris 2009.06 x86 hosts named xCid and xCloud connected >> to one FC array) >> - Created 32 volumes and mapped to the host group; under the host groups >> are the 2 nodes cluster >> - Format the volumes >> - Setup cluster with quorum server named Auron (all 2 nodes joined >> cluster, all of the resource groups and resources are online on 1st node >> xCid) >> >> Below is the status of the cluster before rebooting the nodes. >> root at xCid:~# scstat -p >> ------------------------------------------------------------------ >> >> -- Cluster Nodes -- >> >> Node name Status >> --------- ------ >> Cluster node: xCid Online >> Cluster node: xCloud Online >> >> ------------------------------------------------------------------ >> >> -- Cluster Transport Paths -- >> >> Endpoint Endpoint Status >> -------- -------- ------ >> Transport path: xCid:e1000g3 xCloud:e1000g3 Path online >> Transport path: xCid:e1000g2 xCloud:e1000g2 Path online >> >> ------------------------------------------------------------------ >> >> -- Quorum Summary from latest node reconfiguration -- >> >> Quorum votes possible: 3 >> Quorum votes needed: 2 >> Quorum votes present: 3 >> >> >> -- Quorum Votes by Node (current status) -- >> >> Node Name Present Possible Status >> --------- ------- -------- ------ >> Node votes: xCid 1 1 Online >> Node votes: xCloud 1 1 Online >> >> >> -- Quorum Votes by Device (current status) -- >> >> Device Name Present Possible Status >> ----------- ------- -------- ------ >> Device votes: Auron 1 1 Online >> >> ------------------------------------------------------------------ >> >> -- Device Group Servers -- >> >> Device Group Primary Secondary >> ------------ ------- --------- >> >> >> -- Device Group Status -- >> >> Device Group Status >> ------------ ------ >> >> >> -- Multi-owner Device Groups -- >> >> Device Group Online Status >> ------------ ------------- >> >> ------------------------------------------------------------------ >> >> -- Resource Groups and Resources -- >> >> Group Name Resources >> ---------- --------- >> Resources: xCloud-rg xCloud-nfsres r-nfs >> Resources: nfs-rg nfs-lh-rs nfs-hastp-rs nfs-rs >> >> >> -- Resource Groups -- >> >> Group Name Node Name State Suspended >> ---------- --------- ----- --------- >> Group: xCloud-rg xCid Online No >> Group: xCloud-rg xCloud Offline No >> >> Group: nfs-rg xCid Online No >> Group: nfs-rg xCloud Offline No >> >> >> -- Resources -- >> >> Resource Name Node Name State Status >> Message >> ------------- --------- ----- >> -------------- >> Resource: xCloud-nfsres xCid Online Online - >> LogicalHostname online. >> Resource: xCloud-nfsres xCloud Offline Offline >> >> Resource: r-nfs xCid Online Online - >> Service is online. >> Resource: r-nfs xCloud Offline Offline >> >> Resource: nfs-lh-rs xCid Online Online - >> LogicalHostname online. >> Resource: nfs-lh-rs xCloud Offline Offline >> >> Resource: nfs-hastp-rs xCid Online Online >> Resource: nfs-hastp-rs xCloud Offline Offline >> >> Resource: nfs-rs xCid Online Online - >> Service is online. >> Resource: nfs-rs xCloud Offline Offline >> >> ------------------------------------------------------------------ >> >> -- IPMP Groups -- >> >> Node Name Group Status Adapter Status >> --------- ----- ------ ------- ------ >> IPMP Group: xCid sc_ipmp0 Online e1000g1 Online >> >> IPMP Group: xCloud sc_ipmp0 Online e1000g0 Online >> >> >> -- IPMP Groups in Zones -- >> >> Zone Name Group Status Adapter Status >> --------- ----- ------ ------- ------ >> ------------------------------------------------------------------ >> root at xCid:~# >> >> >> root at xCid:~# clnode show >> >> === Cluster Nodes === >> >> Node Name: xCid >> Node ID: 1 >> Enabled: yes >> privatehostname: clusternode1-priv >> reboot_on_path_failure: disabled >> globalzoneshares: 1 >> defaultpsetmin: 1 >> quorum_vote: 1 >> quorum_defaultvote: 1 >> quorum_resv_key: 0x4A9B35C600000001 >> Transport Adapter List: e1000g2, e1000g3 >> >> Node Name: xCloud >> Node ID: 2 >> Enabled: yes >> privatehostname: clusternode2-priv >> reboot_on_path_failure: disabled >> globalzoneshares: 1 >> defaultpsetmin: 1 >> quorum_vote: 1 >> quorum_defaultvote: 1 >> quorum_resv_key: 0x4A9B35C600000002 >> Transport Adapter List: e1000g2, e1000g3 >> >> root at xCid:~# >> >> >> ****** Reboot 1st node xCid, all of the resources transfer to 2nd >> node xCloud and online on node xCloud ************ >> >> root at xCloud:~# scstat -p >> ------------------------------------------------------------------ >> >> -- Cluster Nodes -- >> >> Node name Status >> --------- ------ >> Cluster node: xCid Online >> Cluster node: xCloud Online >> >> ------------------------------------------------------------------ >> >> -- Cluster Transport Paths -- >> >> Endpoint Endpoint Status >> -------- -------- ------ >> Transport path: xCid:e1000g3 xCloud:e1000g3 Path online >> Transport path: xCid:e1000g2 xCloud:e1000g2 Path online >> >> ------------------------------------------------------------------ >> >> -- Quorum Summary from latest node reconfiguration -- >> >> Quorum votes possible: 3 >> Quorum votes needed: 2 >> Quorum votes present: 3 >> >> >> -- Quorum Votes by Node (current status) -- >> >> Node Name Present Possible Status >> --------- ------- -------- ------ >> Node votes: xCid 1 1 Online >> Node votes: xCloud 1 1 Online >> >> >> -- Quorum Votes by Device (current status) -- >> >> Device Name Present Possible Status >> ----------- ------- -------- ------ >> Device votes: Auron 1 1 Online >> >> ------------------------------------------------------------------ >> >> -- Device Group Servers -- >> >> Device Group Primary Secondary >> ------------ ------- --------- >> >> >> -- Device Group Status -- >> >> Device Group Status >> ------------ ------ >> >> >> -- Multi-owner Device Groups -- >> >> Device Group Online Status >> ------------ ------------- >> >> ------------------------------------------------------------------ >> >> -- Resource Groups and Resources -- >> >> Group Name Resources >> ---------- --------- >> Resources: xCloud-rg xCloud-nfsres r-nfs >> Resources: nfs-rg nfs-lh-rs nfs-hastp-rs nfs-rs >> >> >> -- Resource Groups -- >> >> Group Name Node Name State Suspended >> ---------- --------- ----- --------- >> Group: xCloud-rg xCid Offline No >> Group: xCloud-rg xCloud Online No >> >> Group: nfs-rg xCid Offline No >> Group: nfs-rg xCloud Online No >> >> >> -- Resources -- >> >> Resource Name Node Name State Status >> Message >> ------------- --------- ----- >> -------------- >> Resource: xCloud-nfsres xCid Offline Offline >> Resource: xCloud-nfsres xCloud Online Online - >> LogicalHostname online. >> >> Resource: r-nfs xCid Offline Offline >> Resource: r-nfs xCloud Online Online - >> Service is online. >> >> Resource: nfs-lh-rs xCid Offline Offline >> Resource: nfs-lh-rs xCloud Online Online - >> LogicalHostname online. >> >> Resource: nfs-hastp-rs xCid Offline Offline >> Resource: nfs-hastp-rs xCloud Online Online >> >> Resource: nfs-rs xCid Offline Offline >> Resource: nfs-rs xCloud Online Online - >> Service is online. >> >> ------------------------------------------------------------------ >> >> -- IPMP Groups -- >> >> Node Name Group Status Adapter Status >> --------- ----- ------ ------- ------ >> IPMP Group: xCid sc_ipmp0 Online e1000g1 Online >> >> IPMP Group: xCloud sc_ipmp0 Online e1000g0 Online >> >> >> -- IPMP Groups in Zones -- >> >> Zone Name Group Status Adapter Status >> --------- ----- ------ ------- ------ >> ------------------------------------------------------------------ >> root at xCloud:~# >> >> >> ***********Wait for about 5 minutes, then reboot 2nd node xCloud and >> node xCid panic with the error below ********************* >> >> root at xCid:~# Notifying cluster that this node is panicking >> WARNING: CMM: Reading reservation keys from quorum device Auron failed with >> error 2. >> >> panic[cpu0]/thread=ffffff02d0a623c0: CMM: Cluster lost operational quorum; >> aborting. >> >> ffffff0011976b50 genunix:vcmn_err+2c () ffffff0011976b60 >> cl_runtime:__1cZsc_syslog_msg_log_no_args6FpviipkcpnR__va_list_elemen >> t__nZsc_syslog_msg_status_enum__+1f () ffffff0011976c40 >> cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_e >> num__+8c () ffffff0011976e30 >> cl_haci:__1cOautomaton_implbAstate_machine_qcheck_state6M_nVcmm_autom >> aton_event_t__+57f () ffffff0011976e70 >> cl_haci:__1cIcmm_implStransitions_thread6M_v_+b7 () ffffff0011976e80 >> cl_haci:__1cIcmm_implYtransitions_thread_start6Fpv_v_+9 () >> ffffff0011976ed0 cl_orb:cllwpwrapper+d7 () ffffff0011976ee0 >> unix:thread_start+8 () >> >> syncing file systems... done >> dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel >> 51% done[2mMIdoOe >> >> the host log is attched. >> >> I have gone thru the SunCluster doc on how to setup SunCluster for >> OpenSolaris multiple times, but I don't see any steps that I miss. Can you >> please help to see if this is setup issue or it is a bug? >> >> Thanks, >> >> Janey >> >> --------------------------------------------------------------------- >> --- >> >> _______________________________________________ >> ha-clusters-discuss mailing list >> ha-clusters-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss >> -------------- next part -------------- A non-text attachment was scrubbed... Name: md.tab Type: application/octet-stream Size: 3653 bytes Desc: md.tab URL: <http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20090918/ea0dbd15/attachment-0001.obj>