Hi Janey, If I understood correctly, quorum server is working correctly now, right?
Some more replies inline... Le, Janey wrote: > Hi Sambit, > I looked into the quorum server using command "clquorumserver show +" , it > showed me the old services that quorum server did server before this setup, > so I reinstall my host and setup the quorum server again. There are cases where the quorum server could still be maintaining information about the clusters it serviced in the past. (it can happen as a result of unclean removal of the quorum server, for instance.) This should not affect its action with the current clusters that it is serving. There is a procedure to clean up such stale information anyhow - you do not need to reinstall quorum server again for that. Please look in the section "How to Clean Up the Quorum Server Configuration Information" in the quorum server document (http://docs.sun.com/app/docs/doc/820-4679/gfjrh?l=en&a=view). > From the document that I got from > http://opensolaris.org/os/community/ha-clusters/ohac/Documentation/OHACdocs/ > , and from the restriction, Veritas Volume Manager is not supported in > OpenSolaris HA Cluster. So, I wonder if we need to use metaset to setup > diskset to manage the disk or what I have from the doc is enough? > I'll let other folks answer this one. Thanks & Regards, Sambit > Thanks, > > Janey > -----Original Message----- > From: Sambit.Nayak at Sun.COM [mailto:Sambit.Nayak at Sun.COM] > Sent: Wednesday, September 09, 2009 1:53 AM > To: Le, Janey > Cc: ha-clusters-discuss at opensolaris.org > Subject: Re: [ha-clusters-discuss] Host panic - OpenSolaris SunCluster > > Hi Janey, > > The error message : > > WARNING: CMM: Reading reservation keys from quorum device Auron > failed with error 2. > means that the node xCid failed to read the quorum keys from the quorum > server host Auron. > > It is possible that the node xCid could not contact the quorum server > host Auron at that specific time, when the reconfiguration was in > progress due to reboot of xCloud. > Also please look in the syslog messages on xCid, for any failure > messages related to the quorum server. > > Is the problem happening everytime you reboot xCloud, after both nodes > are successfully online in the cluster? > > > Things that can be done to debug > -------------------------------------------- > More information about this failure will be available in the cluster > kernel trace buffers. > > If you can obtain the kernel dumps of the cluster nodes at the time of > this problem, then we can look into them to debug the problem more. > If you are not able to provide the dumps, then please do : > > *cmm_dbg_buf/s > > *(cmm_dbg_buf+8)+1/s > at the kmdb prompt resulting from the panic (or on the saved crash > dump), and provide that output. > > Please also save the /var/adm/messages of both nodes. > > Each quorum server daemon (on a quorum server host) has an associated > directory where it stores some files. > By default, /var/scqsd is the directory used by a quorum server daemon. > If you have changed the default directory while configuring the quorum > server, then please look in it instead. > There will be files named ".scqsd_dbg_buf*" in such a directory. > Please provide those files as well; they will tell us what's happening > on the quorum server host Auron. > > If you execute "clquorum status" command on a cluster node, then it will > tell if the local node can access the quorum server at the time of this > command execution or not. If access is possible and the node's keys are > present, the quorum server is marked online; else it is marked offline. > So if you execute this command on both cluster nodes before doing the > experiment of rebooting xCloud, then that will tell whether any node was > having problems accessing the quorum server. > Please run that command on both nodes, and capture the output, before > rebooting xCloud. > > Similarly, the "clquorumserver show +" on the quorum server host will > tell what cluster it is serving, and what keys are present on the quorum > server, which cluster node is the owner of the quorum server, etc. > Please capture its output before rebooting xCloud, and after xCid panics > as a result of rebooting xCloud. > > ************ > > Just as a confirmation, the cluster is running Open HA Cluster 2009.06, > and you are using the quorum server packages available with Open HA > Cluster 2009.06, right? > > Thanks & Regards, > Sambit > > > Janey Le wrote: > >> After setting up SunCluster on OpenSolaris, and when I reboot the second >> node of the cluster, my first node panic. Can you please let me know if >> there is anyone that I can contact to know if this is setup issue or it is >> cluster bug? >> >> Below is the setup that I had: >> >> - 2x1 ( 2 OpenSolaris 2009.06 x86 hosts named xCid and xCloud connected >> to one FC array) >> - Created 32 volumes and mapped to the host group; under the host groups >> are the 2 nodes cluster >> - Format the volumes >> - Setup cluster with quorum server named Auron (all 2 nodes joined >> cluster, all of the resource groups and resources are online on 1st node >> xCid) >> >> Below is the status of the cluster before rebooting the nodes. >> root at xCid:~# scstat -p >> ------------------------------------------------------------------ >> >> -- Cluster Nodes -- >> >> Node name Status >> --------- ------ >> Cluster node: xCid Online >> Cluster node: xCloud Online >> >> ------------------------------------------------------------------ >> >> -- Cluster Transport Paths -- >> >> Endpoint Endpoint Status >> -------- -------- ------ >> Transport path: xCid:e1000g3 xCloud:e1000g3 Path online >> Transport path: xCid:e1000g2 xCloud:e1000g2 Path online >> >> ------------------------------------------------------------------ >> >> -- Quorum Summary from latest node reconfiguration -- >> >> Quorum votes possible: 3 >> Quorum votes needed: 2 >> Quorum votes present: 3 >> >> >> -- Quorum Votes by Node (current status) -- >> >> Node Name Present Possible Status >> --------- ------- -------- ------ >> Node votes: xCid 1 1 Online >> Node votes: xCloud 1 1 Online >> >> >> -- Quorum Votes by Device (current status) -- >> >> Device Name Present Possible Status >> ----------- ------- -------- ------ >> Device votes: Auron 1 1 Online >> >> ------------------------------------------------------------------ >> >> -- Device Group Servers -- >> >> Device Group Primary Secondary >> ------------ ------- --------- >> >> >> -- Device Group Status -- >> >> Device Group Status >> ------------ ------ >> >> >> -- Multi-owner Device Groups -- >> >> Device Group Online Status >> ------------ ------------- >> >> ------------------------------------------------------------------ >> >> -- Resource Groups and Resources -- >> >> Group Name Resources >> ---------- --------- >> Resources: xCloud-rg xCloud-nfsres r-nfs >> Resources: nfs-rg nfs-lh-rs nfs-hastp-rs nfs-rs >> >> >> -- Resource Groups -- >> >> Group Name Node Name State Suspended >> ---------- --------- ----- --------- >> Group: xCloud-rg xCid Online No >> Group: xCloud-rg xCloud Offline No >> >> Group: nfs-rg xCid Online No >> Group: nfs-rg xCloud Offline No >> >> >> -- Resources -- >> >> Resource Name Node Name State Status >> Message >> ------------- --------- ----- >> -------------- >> Resource: xCloud-nfsres xCid Online Online - >> LogicalHostname online. >> Resource: xCloud-nfsres xCloud Offline Offline >> >> Resource: r-nfs xCid Online Online - >> Service is online. >> Resource: r-nfs xCloud Offline Offline >> >> Resource: nfs-lh-rs xCid Online Online - >> LogicalHostname online. >> Resource: nfs-lh-rs xCloud Offline Offline >> >> Resource: nfs-hastp-rs xCid Online Online >> Resource: nfs-hastp-rs xCloud Offline Offline >> >> Resource: nfs-rs xCid Online Online - >> Service is online. >> Resource: nfs-rs xCloud Offline Offline >> >> ------------------------------------------------------------------ >> >> -- IPMP Groups -- >> >> Node Name Group Status Adapter Status >> --------- ----- ------ ------- ------ >> IPMP Group: xCid sc_ipmp0 Online e1000g1 Online >> >> IPMP Group: xCloud sc_ipmp0 Online e1000g0 Online >> >> >> -- IPMP Groups in Zones -- >> >> Zone Name Group Status Adapter Status >> --------- ----- ------ ------- ------ >> ------------------------------------------------------------------ >> root at xCid:~# >> >> >> root at xCid:~# clnode show >> >> === Cluster Nodes === >> >> Node Name: xCid >> Node ID: 1 >> Enabled: yes >> privatehostname: clusternode1-priv >> reboot_on_path_failure: disabled >> globalzoneshares: 1 >> defaultpsetmin: 1 >> quorum_vote: 1 >> quorum_defaultvote: 1 >> quorum_resv_key: 0x4A9B35C600000001 >> Transport Adapter List: e1000g2, e1000g3 >> >> Node Name: xCloud >> Node ID: 2 >> Enabled: yes >> privatehostname: clusternode2-priv >> reboot_on_path_failure: disabled >> globalzoneshares: 1 >> defaultpsetmin: 1 >> quorum_vote: 1 >> quorum_defaultvote: 1 >> quorum_resv_key: 0x4A9B35C600000002 >> Transport Adapter List: e1000g2, e1000g3 >> >> root at xCid:~# >> >> >> ****** Reboot 1st node xCid, all of the resources transfer to 2nd node >> xCloud and online on node xCloud ************ >> >> root at xCloud:~# scstat -p >> ------------------------------------------------------------------ >> >> -- Cluster Nodes -- >> >> Node name Status >> --------- ------ >> Cluster node: xCid Online >> Cluster node: xCloud Online >> >> ------------------------------------------------------------------ >> >> -- Cluster Transport Paths -- >> >> Endpoint Endpoint Status >> -------- -------- ------ >> Transport path: xCid:e1000g3 xCloud:e1000g3 Path online >> Transport path: xCid:e1000g2 xCloud:e1000g2 Path online >> >> ------------------------------------------------------------------ >> >> -- Quorum Summary from latest node reconfiguration -- >> >> Quorum votes possible: 3 >> Quorum votes needed: 2 >> Quorum votes present: 3 >> >> >> -- Quorum Votes by Node (current status) -- >> >> Node Name Present Possible Status >> --------- ------- -------- ------ >> Node votes: xCid 1 1 Online >> Node votes: xCloud 1 1 Online >> >> >> -- Quorum Votes by Device (current status) -- >> >> Device Name Present Possible Status >> ----------- ------- -------- ------ >> Device votes: Auron 1 1 Online >> >> ------------------------------------------------------------------ >> >> -- Device Group Servers -- >> >> Device Group Primary Secondary >> ------------ ------- --------- >> >> >> -- Device Group Status -- >> >> Device Group Status >> ------------ ------ >> >> >> -- Multi-owner Device Groups -- >> >> Device Group Online Status >> ------------ ------------- >> >> ------------------------------------------------------------------ >> >> -- Resource Groups and Resources -- >> >> Group Name Resources >> ---------- --------- >> Resources: xCloud-rg xCloud-nfsres r-nfs >> Resources: nfs-rg nfs-lh-rs nfs-hastp-rs nfs-rs >> >> >> -- Resource Groups -- >> >> Group Name Node Name State Suspended >> ---------- --------- ----- --------- >> Group: xCloud-rg xCid Offline No >> Group: xCloud-rg xCloud Online No >> >> Group: nfs-rg xCid Offline No >> Group: nfs-rg xCloud Online No >> >> >> -- Resources -- >> >> Resource Name Node Name State Status >> Message >> ------------- --------- ----- >> -------------- >> Resource: xCloud-nfsres xCid Offline Offline >> Resource: xCloud-nfsres xCloud Online Online - >> LogicalHostname online. >> >> Resource: r-nfs xCid Offline Offline >> Resource: r-nfs xCloud Online Online - >> Service is online. >> >> Resource: nfs-lh-rs xCid Offline Offline >> Resource: nfs-lh-rs xCloud Online Online - >> LogicalHostname online. >> >> Resource: nfs-hastp-rs xCid Offline Offline >> Resource: nfs-hastp-rs xCloud Online Online >> >> Resource: nfs-rs xCid Offline Offline >> Resource: nfs-rs xCloud Online Online - >> Service is online. >> >> ------------------------------------------------------------------ >> >> -- IPMP Groups -- >> >> Node Name Group Status Adapter Status >> --------- ----- ------ ------- ------ >> IPMP Group: xCid sc_ipmp0 Online e1000g1 Online >> >> IPMP Group: xCloud sc_ipmp0 Online e1000g0 Online >> >> >> -- IPMP Groups in Zones -- >> >> Zone Name Group Status Adapter Status >> --------- ----- ------ ------- ------ >> ------------------------------------------------------------------ >> root at xCloud:~# >> >> >> ***********Wait for about 5 minutes, then reboot 2nd node xCloud and node >> xCid panic with the error below ********************* >> >> root at xCid:~# Notifying cluster that this node is panicking >> WARNING: CMM: Reading reservation keys from quorum device Auron failed with >> error 2. >> >> panic[cpu0]/thread=ffffff02d0a623c0: CMM: Cluster lost operational quorum; >> aborting. >> >> ffffff0011976b50 genunix:vcmn_err+2c () >> ffffff0011976b60 >> cl_runtime:__1cZsc_syslog_msg_log_no_args6FpviipkcpnR__va_list_element__nZsc_syslog_msg_status_enum__+1f >> () >> ffffff0011976c40 >> cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_enum__+8c >> () >> ffffff0011976e30 >> cl_haci:__1cOautomaton_implbAstate_machine_qcheck_state6M_nVcmm_automaton_event_t__+57f >> () >> ffffff0011976e70 cl_haci:__1cIcmm_implStransitions_thread6M_v_+b7 () >> ffffff0011976e80 cl_haci:__1cIcmm_implYtransitions_thread_start6Fpv_v_+9 () >> ffffff0011976ed0 cl_orb:cllwpwrapper+d7 () >> ffffff0011976ee0 unix:thread_start+8 () >> >> syncing file systems... done >> dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel >> 51% done[2mMIdoOe >> >> the host log is attched. >> >> I have gone thru the SunCluster doc on how to setup SunCluster for >> OpenSolaris multiple times, but I don't see any steps that I miss. Can you >> please help to see if this is setup issue or it is a bug? >> >> Thanks, >> >> Janey >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> ha-clusters-discuss mailing list >> ha-clusters-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss >>