[ha-clusters-discuss] Host panic - OpenSolaris SunCluster

Sambit Nayak Mon, 14 Sep 2009 14:10:35 +0530

Hi Janey,

If I understood correctly, quorum server is working correctly now, right?


Some more replies inline...

Le, Janey wrote:
> Hi Sambit,
> I looked into the quorum server using command "clquorumserver show +" , it 
> showed me the old services that quorum server did server before this setup, 
> so I reinstall my  host and setup the quorum server again.  

There are cases where the quorum server could still be maintaining 
information about the clusters it serviced in the past.
(it can happen as a result of unclean removal of the quorum server, for 
instance.)

This should not affect its action with the current clusters that it is 
serving.

There is a procedure to clean up such stale information anyhow - you do 
not need to reinstall quorum server again for that.
Please look in the section "How to Clean Up the Quorum Server 
Configuration Information" in the quorum server document 
(http://docs.sun.com/app/docs/doc/820-4679/gfjrh?l=en&a=view).

> From the document that I got from 
> http://opensolaris.org/os/community/ha-clusters/ohac/Documentation/OHACdocs/ 
> , and from the restriction, Veritas Volume Manager is not supported in 
> OpenSolaris HA Cluster.  So, I wonder if we need to use metaset to setup 
> diskset to manage the disk or what I have from the doc is enough?
>   

I'll let other folks answer this one.

Thanks & Regards,
Sambit

> Thanks,
>
> Janey
> -----Original Message-----
> From: Sambit.Nayak at Sun.COM [mailto:Sambit.Nayak at Sun.COM]
> Sent: Wednesday, September 09, 2009 1:53 AM
> To: Le, Janey
> Cc: ha-clusters-discuss at opensolaris.org
> Subject: Re: [ha-clusters-discuss] Host panic - OpenSolaris SunCluster
>
> Hi Janey,
>
> The error message :
>  > WARNING: CMM: Reading reservation keys from quorum device Auron
> failed with error 2.
> means that the node xCid failed to read the quorum keys from the quorum
> server host Auron.
>
> It is possible that the node xCid could not contact the quorum server
> host Auron at that specific time, when the reconfiguration was in
> progress due to reboot of xCloud.
> Also please look in the syslog messages on xCid, for any failure
> messages related to the quorum server.
>
> Is the problem happening everytime you reboot xCloud, after both nodes
> are successfully online in the cluster?
>
>
> Things that can be done to debug
> --------------------------------------------
> More information about this failure will be available in the cluster
> kernel trace buffers.
>
> If you can obtain the kernel dumps of the cluster nodes at the time of
> this problem, then we can look into them to debug the problem more.
> If you are not able to provide the dumps, then please do :
>  > *cmm_dbg_buf/s
>  > *(cmm_dbg_buf+8)+1/s
> at the kmdb prompt resulting from the panic (or on the saved crash
> dump), and provide that output.
>
> Please also save the /var/adm/messages of both nodes.
>
> Each quorum server daemon (on a quorum server host) has an associated
> directory where it stores some files.
> By default, /var/scqsd is the directory used by a quorum server daemon.
> If you have changed the default directory while configuring the quorum
> server, then please look in it instead.
> There will be files named ".scqsd_dbg_buf*" in such a directory.
> Please provide those files as well; they will tell us what's happening
> on the quorum server host Auron.
>
> If you execute "clquorum status" command on a cluster node, then it will
> tell if the local node can access the quorum server at the time of this
> command execution or not. If access is possible and the node's keys are
> present, the quorum server is marked online; else it is marked offline.
> So if you execute this command on both cluster nodes before doing the
> experiment of rebooting xCloud, then that will tell whether any node was
> having problems accessing the quorum server.
> Please run that command on both nodes, and capture the output, before
> rebooting xCloud.
>
> Similarly, the "clquorumserver show +" on the quorum server host will
> tell what cluster it is serving, and what keys are present on the quorum
> server, which cluster node is the owner of the quorum server, etc.
> Please capture its output before rebooting xCloud, and after xCid panics
> as a result of rebooting xCloud.
>
> ************
>
> Just as a confirmation, the cluster is running Open HA Cluster 2009.06,
> and you are using the quorum server packages available with Open HA
> Cluster 2009.06, right?
>
> Thanks & Regards,
> Sambit
>
>
> Janey Le wrote:
>   
>> After setting up SunCluster on OpenSolaris, and when I reboot the second 
>> node of the cluster, my first node panic.  Can you please let me know if 
>> there is anyone that I can contact to know if this is setup issue or it is 
>> cluster bug?
>>
>> Below is the setup that I had:
>>
>> -     2x1 ( 2 OpenSolaris 2009.06 x86 hosts named xCid and xCloud connected 
>> to one FC array)
>> -     Created 32 volumes and mapped to the host group; under the host groups 
>> are the 2 nodes cluster
>> -     Format the volumes
>> -     Setup cluster with quorum server named Auron (all 2 nodes joined 
>> cluster, all of the resource groups and resources are online on 1st node 
>> xCid)
>>
>> Below is the status of the cluster before rebooting the nodes.
>> root at xCid:~# scstat -p
>> ------------------------------------------------------------------
>>
>> -- Cluster Nodes --
>>
>>                     Node name           Status
>>                     ---------           ------
>>   Cluster node:     xCid                Online
>>   Cluster node:     xCloud              Online
>>
>> ------------------------------------------------------------------
>>
>> -- Cluster Transport Paths --
>>
>>                     Endpoint               Endpoint               Status
>>                     --------               --------               ------
>>   Transport path:   xCid:e1000g3           xCloud:e1000g3         Path online
>>   Transport path:   xCid:e1000g2           xCloud:e1000g2         Path online
>>
>> ------------------------------------------------------------------
>>
>> -- Quorum Summary from latest node reconfiguration --
>>
>>   Quorum votes possible:      3
>>   Quorum votes needed:        2
>>   Quorum votes present:       3
>>
>>
>> -- Quorum Votes by Node (current status) --
>>
>>                     Node Name           Present Possible Status
>>                     ---------           ------- -------- ------
>>   Node votes:       xCid                1        1       Online
>>   Node votes:       xCloud              1        1       Online
>>
>>
>> -- Quorum Votes by Device (current status) --
>>
>>                     Device Name         Present Possible Status
>>                     -----------         ------- -------- ------
>>   Device votes:     Auron               1        1       Online
>>
>> ------------------------------------------------------------------
>>
>> -- Device Group Servers --
>>
>>                          Device Group        Primary             Secondary
>>                          ------------        -------             ---------
>>
>>
>> -- Device Group Status --
>>
>>                               Device Group        Status
>>                               ------------        ------
>>
>>
>> -- Multi-owner Device Groups --
>>
>>                               Device Group        Online Status
>>                               ------------        -------------
>>
>> ------------------------------------------------------------------
>>
>> -- Resource Groups and Resources --
>>
>>             Group Name     Resources
>>             ----------     ---------
>>  Resources: xCloud-rg      xCloud-nfsres r-nfs
>>  Resources: nfs-rg         nfs-lh-rs nfs-hastp-rs nfs-rs
>>
>>
>> -- Resource Groups --
>>
>>             Group Name     Node Name                State          Suspended
>>             ----------     ---------                -----          ---------
>>      Group: xCloud-rg      xCid                     Online         No
>>      Group: xCloud-rg      xCloud                   Offline        No
>>
>>      Group: nfs-rg         xCid                     Online         No
>>      Group: nfs-rg         xCloud                   Offline        No
>>
>>
>> -- Resources --
>>
>>             Resource Name  Node Name                State          Status 
>> Message
>>             -------------  ---------                -----          
>> --------------
>>   Resource: xCloud-nfsres  xCid                     Online         Online - 
>> LogicalHostname online.
>>   Resource: xCloud-nfsres  xCloud                   Offline        Offline
>>
>>   Resource: r-nfs          xCid                     Online         Online - 
>> Service is online.
>>   Resource: r-nfs          xCloud                   Offline        Offline
>>
>>   Resource: nfs-lh-rs      xCid                     Online         Online - 
>> LogicalHostname online.
>>   Resource: nfs-lh-rs      xCloud                   Offline        Offline
>>
>>   Resource: nfs-hastp-rs   xCid                     Online         Online
>>   Resource: nfs-hastp-rs   xCloud                   Offline        Offline
>>
>>   Resource: nfs-rs         xCid                     Online         Online - 
>> Service is online.
>>   Resource: nfs-rs         xCloud                   Offline        Offline
>>
>> ------------------------------------------------------------------
>>
>> -- IPMP Groups --
>>
>>               Node Name           Group   Status         Adapter   Status
>>               ---------           -----   ------         -------   ------
>>   IPMP Group: xCid                sc_ipmp0 Online         e1000g1   Online
>>
>>   IPMP Group: xCloud              sc_ipmp0 Online         e1000g0   Online
>>
>>
>> -- IPMP Groups in Zones --
>>
>>               Zone Name           Group   Status         Adapter   Status
>>               ---------           -----   ------         -------   ------
>> ------------------------------------------------------------------
>> root at xCid:~#
>>
>>
>> root at xCid:~# clnode show
>>
>> === Cluster Nodes ===
>>
>> Node Name:                                      xCid
>>   Node ID:                                         1
>>   Enabled:                                         yes
>>   privatehostname:                                 clusternode1-priv
>>   reboot_on_path_failure:                          disabled
>>   globalzoneshares:                                1
>>   defaultpsetmin:                                  1
>>   quorum_vote:                                     1
>>   quorum_defaultvote:                              1
>>   quorum_resv_key:                                 0x4A9B35C600000001
>>   Transport Adapter List:                          e1000g2, e1000g3
>>
>> Node Name:                                      xCloud
>>   Node ID:                                         2
>>   Enabled:                                         yes
>>   privatehostname:                                 clusternode2-priv
>>   reboot_on_path_failure:                          disabled
>>   globalzoneshares:                                1
>>   defaultpsetmin:                                  1
>>   quorum_vote:                                     1
>>   quorum_defaultvote:                              1
>>   quorum_resv_key:                                 0x4A9B35C600000002
>>   Transport Adapter List:                          e1000g2, e1000g3
>>
>> root at xCid:~#
>>
>>
>> ******  Reboot 1st node xCid, all of the resources transfer to 2nd node 
>> xCloud  and online on node xCloud  ************
>>
>> root at xCloud:~# scstat -p
>> ------------------------------------------------------------------
>>
>> -- Cluster Nodes --
>>
>>                     Node name           Status
>>                     ---------           ------
>>   Cluster node:     xCid                Online
>>   Cluster node:     xCloud              Online
>>
>> ------------------------------------------------------------------
>>
>> -- Cluster Transport Paths --
>>
>>                     Endpoint               Endpoint               Status
>>                     --------               --------               ------
>>   Transport path:   xCid:e1000g3           xCloud:e1000g3         Path online
>>   Transport path:   xCid:e1000g2           xCloud:e1000g2         Path online
>>
>> ------------------------------------------------------------------
>>
>> -- Quorum Summary from latest node reconfiguration --
>>
>>   Quorum votes possible:      3
>>   Quorum votes needed:        2
>>   Quorum votes present:       3
>>
>>
>> -- Quorum Votes by Node (current status) --
>>
>>                     Node Name           Present Possible Status
>>                     ---------           ------- -------- ------
>>   Node votes:       xCid                1        1       Online
>>   Node votes:       xCloud              1        1       Online
>>
>>
>> -- Quorum Votes by Device (current status) --
>>
>>                     Device Name         Present Possible Status
>>                     -----------         ------- -------- ------
>>   Device votes:     Auron               1        1       Online
>>
>> ------------------------------------------------------------------
>>
>> -- Device Group Servers --
>>
>>                          Device Group        Primary             Secondary
>>                          ------------        -------             ---------
>>
>>
>> -- Device Group Status --
>>
>>                               Device Group        Status
>>                               ------------        ------
>>
>>
>> -- Multi-owner Device Groups --
>>
>>                               Device Group        Online Status
>>                               ------------        -------------
>>
>> ------------------------------------------------------------------
>>
>> -- Resource Groups and Resources --
>>
>>             Group Name     Resources
>>             ----------     ---------
>>  Resources: xCloud-rg      xCloud-nfsres r-nfs
>>  Resources: nfs-rg         nfs-lh-rs nfs-hastp-rs nfs-rs
>>
>>
>> -- Resource Groups --
>>
>>             Group Name     Node Name                State          Suspended
>>             ----------     ---------                -----          ---------
>>      Group: xCloud-rg      xCid                     Offline        No
>>      Group: xCloud-rg      xCloud                   Online         No
>>
>>      Group: nfs-rg         xCid                     Offline        No
>>      Group: nfs-rg         xCloud                   Online         No
>>
>>
>> -- Resources --
>>
>>             Resource Name  Node Name                State          Status 
>> Message
>>             -------------  ---------                -----          
>> --------------
>>   Resource: xCloud-nfsres  xCid                     Offline        Offline
>>   Resource: xCloud-nfsres  xCloud                   Online         Online - 
>> LogicalHostname online.
>>
>>   Resource: r-nfs          xCid                     Offline        Offline
>>   Resource: r-nfs          xCloud                   Online         Online - 
>> Service is online.
>>
>>   Resource: nfs-lh-rs      xCid                     Offline        Offline
>>   Resource: nfs-lh-rs      xCloud                   Online         Online - 
>> LogicalHostname online.
>>
>>   Resource: nfs-hastp-rs   xCid                     Offline        Offline
>>   Resource: nfs-hastp-rs   xCloud                   Online         Online
>>
>>   Resource: nfs-rs         xCid                     Offline        Offline
>>   Resource: nfs-rs         xCloud                   Online         Online - 
>> Service is online.
>>
>> ------------------------------------------------------------------
>>
>> -- IPMP Groups --
>>
>>               Node Name           Group   Status         Adapter   Status
>>               ---------           -----   ------         -------   ------
>>   IPMP Group: xCid                sc_ipmp0 Online         e1000g1   Online
>>
>>   IPMP Group: xCloud              sc_ipmp0 Online         e1000g0   Online
>>
>>
>> -- IPMP Groups in Zones --
>>
>>               Zone Name           Group   Status         Adapter   Status
>>               ---------           -----   ------         -------   ------
>> ------------------------------------------------------------------
>> root at xCloud:~#
>>
>>
>> ***********Wait for about 5 minutes, then reboot 2nd node xCloud and  node 
>> xCid panic with the error below *********************
>>
>> root at xCid:~# Notifying cluster that this node is panicking
>> WARNING: CMM: Reading reservation keys from quorum device Auron failed with 
>> error 2.
>>
>> panic[cpu0]/thread=ffffff02d0a623c0: CMM: Cluster lost operational quorum; 
>> aborting.
>>
>> ffffff0011976b50 genunix:vcmn_err+2c ()
>> ffffff0011976b60 
>> cl_runtime:__1cZsc_syslog_msg_log_no_args6FpviipkcpnR__va_list_element__nZsc_syslog_msg_status_enum__+1f
>>  ()
>> ffffff0011976c40 
>> cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_enum__+8c
>>  ()
>> ffffff0011976e30 
>> cl_haci:__1cOautomaton_implbAstate_machine_qcheck_state6M_nVcmm_automaton_event_t__+57f
>>  ()
>> ffffff0011976e70 cl_haci:__1cIcmm_implStransitions_thread6M_v_+b7 ()
>> ffffff0011976e80 cl_haci:__1cIcmm_implYtransitions_thread_start6Fpv_v_+9 ()
>> ffffff0011976ed0 cl_orb:cllwpwrapper+d7 ()
>> ffffff0011976ee0 unix:thread_start+8 ()
>>
>> syncing file systems... done
>> dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
>>  51% done[2mMIdoOe
>>
>> the host log is attched.
>>
>> I have gone thru the SunCluster doc  on how to setup SunCluster for 
>> OpenSolaris multiple times, but I don't see any steps that I miss.  Can you 
>> please help to see if this is setup issue or it is a bug?
>>
>> Thanks,
>>
>> Janey
>>
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> ha-clusters-discuss mailing list
>> ha-clusters-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss
>>

[ha-clusters-discuss] Host panic - OpenSolaris SunCluster

Reply via email to