[ha-clusters-discuss] Host panic - OpenSolaris SunCluster

Tirthankar Mon, 14 Sep 2009 14:58:08 +0530

Instead of using metasets (svm), use ZFS/ZPOOL, much easier to setup.

Thanks,
Tirthankar


http://blogs.sun.com/tirthankar



On 09/14/09 14:41, Binu Jose Philip wrote:
> On Mon, Sep 14, 2009 at 2:10 PM, Sambit Nayak <Sambit.Nayak at sun.com> wrote:
>   
>> Hi Janey,
>>
>> If I understood correctly, quorum server is working correctly now, right?
>>
>> Some more replies inline...
>>
>> Le, Janey wrote:
>>     
>>> Hi Sambit,
>>> I looked into the quorum server using command "clquorumserver show +" , it
>>> showed me the old services that quorum server did server before this setup,
>>> so I reinstall my  host and setup the quorum server again.
>>>       
>> There are cases where the quorum server could still be maintaining
>> information about the clusters it serviced in the past.
>> (it can happen as a result of unclean removal of the quorum server, for
>> instance.)
>>
>> This should not affect its action with the current clusters that it is
>> serving.
>>
>> There is a procedure to clean up such stale information anyhow - you do not
>> need to reinstall quorum server again for that.
>> Please look in the section "How to Clean Up the Quorum Server Configuration
>> Information" in the quorum server document
>> (http://docs.sun.com/app/docs/doc/820-4679/gfjrh?l=en&a=view).
>>
>>     
>>> From the document that I got from
>>> http://opensolaris.org/os/community/ha-clusters/ohac/Documentation/OHACdocs/
>>> , and from the restriction, Veritas Volume Manager is not supported in
>>> OpenSolaris HA Cluster.  So, I wonder if we need to use metaset to setup
>>> diskset to manage the disk or what I have from the doc is enough?
>>>       
>
> If you want to create failover disksets and volumes from shared disks
> then yes, you will need to use meta* commands to create a mulit-host
> diskset. Other wise, i.e. you don't need disksets/volumes, you can use
> the shared devices as they are using the did path.
>
> cheers
> Binu
>
>   
>> I'll let other folks answer this one.
>>
>> Thanks & Regards,
>> Sambit
>>
>>     
>>> Thanks,
>>>
>>> Janey
>>> -----Original Message-----
>>> From: Sambit.Nayak at Sun.COM [mailto:Sambit.Nayak at Sun.COM]
>>> Sent: Wednesday, September 09, 2009 1:53 AM
>>> To: Le, Janey
>>> Cc: ha-clusters-discuss at opensolaris.org
>>> Subject: Re: [ha-clusters-discuss] Host panic - OpenSolaris SunCluster
>>>
>>> Hi Janey,
>>>
>>> The error message :
>>>  > WARNING: CMM: Reading reservation keys from quorum device Auron
>>> failed with error 2.
>>> means that the node xCid failed to read the quorum keys from the quorum
>>> server host Auron.
>>>
>>> It is possible that the node xCid could not contact the quorum server
>>> host Auron at that specific time, when the reconfiguration was in
>>> progress due to reboot of xCloud.
>>> Also please look in the syslog messages on xCid, for any failure
>>> messages related to the quorum server.
>>>
>>> Is the problem happening everytime you reboot xCloud, after both nodes
>>> are successfully online in the cluster?
>>>
>>>
>>> Things that can be done to debug
>>> --------------------------------------------
>>> More information about this failure will be available in the cluster
>>> kernel trace buffers.
>>>
>>> If you can obtain the kernel dumps of the cluster nodes at the time of
>>> this problem, then we can look into them to debug the problem more.
>>> If you are not able to provide the dumps, then please do :
>>>  > *cmm_dbg_buf/s
>>>  > *(cmm_dbg_buf+8)+1/s
>>> at the kmdb prompt resulting from the panic (or on the saved crash
>>> dump), and provide that output.
>>>
>>> Please also save the /var/adm/messages of both nodes.
>>>
>>> Each quorum server daemon (on a quorum server host) has an associated
>>> directory where it stores some files.
>>> By default, /var/scqsd is the directory used by a quorum server daemon.
>>> If you have changed the default directory while configuring the quorum
>>> server, then please look in it instead.
>>> There will be files named ".scqsd_dbg_buf*" in such a directory.
>>> Please provide those files as well; they will tell us what's happening
>>> on the quorum server host Auron.
>>>
>>> If you execute "clquorum status" command on a cluster node, then it will
>>> tell if the local node can access the quorum server at the time of this
>>> command execution or not. If access is possible and the node's keys are
>>> present, the quorum server is marked online; else it is marked offline.
>>> So if you execute this command on both cluster nodes before doing the
>>> experiment of rebooting xCloud, then that will tell whether any node was
>>> having problems accessing the quorum server.
>>> Please run that command on both nodes, and capture the output, before
>>> rebooting xCloud.
>>>
>>> Similarly, the "clquorumserver show +" on the quorum server host will
>>> tell what cluster it is serving, and what keys are present on the quorum
>>> server, which cluster node is the owner of the quorum server, etc.
>>> Please capture its output before rebooting xCloud, and after xCid panics
>>> as a result of rebooting xCloud.
>>>
>>> ************
>>>
>>> Just as a confirmation, the cluster is running Open HA Cluster 2009.06,
>>> and you are using the quorum server packages available with Open HA
>>> Cluster 2009.06, right?
>>>
>>> Thanks & Regards,
>>> Sambit
>>>
>>>
>>> Janey Le wrote:
>>>
>>>       
>>>> After setting up SunCluster on OpenSolaris, and when I reboot the second
>>>> node of the cluster, my first node panic.  Can you please let me know if
>>>> there is anyone that I can contact to know if this is setup issue or it is
>>>> cluster bug?
>>>>
>>>> Below is the setup that I had:
>>>>
>>>> -     2x1 ( 2 OpenSolaris 2009.06 x86 hosts named xCid and xCloud
>>>> connected to one FC array)
>>>> -     Created 32 volumes and mapped to the host group; under the host
>>>> groups are the 2 nodes cluster
>>>> -     Format the volumes
>>>> -     Setup cluster with quorum server named Auron (all 2 nodes joined
>>>> cluster, all of the resource groups and resources are online on 1st node
>>>> xCid)
>>>>
>>>> Below is the status of the cluster before rebooting the nodes.
>>>> root at xCid:~# scstat -p
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Cluster Nodes --
>>>>
>>>>                    Node name           Status
>>>>                    ---------           ------
>>>>  Cluster node:     xCid                Online
>>>>  Cluster node:     xCloud              Online
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Cluster Transport Paths --
>>>>
>>>>                    Endpoint               Endpoint               Status
>>>>                    --------               --------               ------
>>>>  Transport path:   xCid:e1000g3           xCloud:e1000g3         Path
>>>> online
>>>>  Transport path:   xCid:e1000g2           xCloud:e1000g2         Path
>>>> online
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Quorum Summary from latest node reconfiguration --
>>>>
>>>>  Quorum votes possible:      3
>>>>  Quorum votes needed:        2
>>>>  Quorum votes present:       3
>>>>
>>>>
>>>> -- Quorum Votes by Node (current status) --
>>>>
>>>>                    Node Name           Present Possible Status
>>>>                    ---------           ------- -------- ------
>>>>  Node votes:       xCid                1        1       Online
>>>>  Node votes:       xCloud              1        1       Online
>>>>
>>>>
>>>> -- Quorum Votes by Device (current status) --
>>>>
>>>>                    Device Name         Present Possible Status
>>>>                    -----------         ------- -------- ------
>>>>  Device votes:     Auron               1        1       Online
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Device Group Servers --
>>>>
>>>>                         Device Group        Primary             Secondary
>>>>                         ------------        -------             ---------
>>>>
>>>>
>>>> -- Device Group Status --
>>>>
>>>>                              Device Group        Status
>>>>                              ------------        ------
>>>>
>>>>
>>>> -- Multi-owner Device Groups --
>>>>
>>>>                              Device Group        Online Status
>>>>                              ------------        -------------
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Resource Groups and Resources --
>>>>
>>>>            Group Name     Resources
>>>>            ----------     ---------
>>>>  Resources: xCloud-rg      xCloud-nfsres r-nfs
>>>>  Resources: nfs-rg         nfs-lh-rs nfs-hastp-rs nfs-rs
>>>>
>>>>
>>>> -- Resource Groups --
>>>>
>>>>            Group Name     Node Name                State
>>>>  Suspended
>>>>            ----------     ---------                -----
>>>>  ---------
>>>>     Group: xCloud-rg      xCid                     Online         No
>>>>     Group: xCloud-rg      xCloud                   Offline        No
>>>>
>>>>     Group: nfs-rg         xCid                     Online         No
>>>>     Group: nfs-rg         xCloud                   Offline        No
>>>>
>>>>
>>>> -- Resources --
>>>>
>>>>            Resource Name  Node Name                State          Status
>>>> Message
>>>>            -------------  ---------                -----
>>>>  --------------
>>>>  Resource: xCloud-nfsres  xCid                     Online         Online
>>>> - LogicalHostname online.
>>>>  Resource: xCloud-nfsres  xCloud                   Offline        Offline
>>>>
>>>>  Resource: r-nfs          xCid                     Online         Online
>>>> - Service is online.
>>>>  Resource: r-nfs          xCloud                   Offline        Offline
>>>>
>>>>  Resource: nfs-lh-rs      xCid                     Online         Online
>>>> - LogicalHostname online.
>>>>  Resource: nfs-lh-rs      xCloud                   Offline        Offline
>>>>
>>>>  Resource: nfs-hastp-rs   xCid                     Online         Online
>>>>  Resource: nfs-hastp-rs   xCloud                   Offline        Offline
>>>>
>>>>  Resource: nfs-rs         xCid                     Online         Online
>>>> - Service is online.
>>>>  Resource: nfs-rs         xCloud                   Offline        Offline
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- IPMP Groups --
>>>>
>>>>              Node Name           Group   Status         Adapter   Status
>>>>              ---------           -----   ------         -------   ------
>>>>  IPMP Group: xCid                sc_ipmp0 Online         e1000g1   Online
>>>>
>>>>  IPMP Group: xCloud              sc_ipmp0 Online         e1000g0   Online
>>>>
>>>>
>>>> -- IPMP Groups in Zones --
>>>>
>>>>              Zone Name           Group   Status         Adapter   Status
>>>>              ---------           -----   ------         -------   ------
>>>> ------------------------------------------------------------------
>>>> root at xCid:~#
>>>>
>>>>
>>>> root at xCid:~# clnode show
>>>>
>>>> === Cluster Nodes ===
>>>>
>>>> Node Name:                                      xCid
>>>>  Node ID:                                         1
>>>>  Enabled:                                         yes
>>>>  privatehostname:                                 clusternode1-priv
>>>>  reboot_on_path_failure:                          disabled
>>>>  globalzoneshares:                                1
>>>>  defaultpsetmin:                                  1
>>>>  quorum_vote:                                     1
>>>>  quorum_defaultvote:                              1
>>>>  quorum_resv_key:                                 0x4A9B35C600000001
>>>>  Transport Adapter List:                          e1000g2, e1000g3
>>>>
>>>> Node Name:                                      xCloud
>>>>  Node ID:                                         2
>>>>  Enabled:                                         yes
>>>>  privatehostname:                                 clusternode2-priv
>>>>  reboot_on_path_failure:                          disabled
>>>>  globalzoneshares:                                1
>>>>  defaultpsetmin:                                  1
>>>>  quorum_vote:                                     1
>>>>  quorum_defaultvote:                              1
>>>>  quorum_resv_key:                                 0x4A9B35C600000002
>>>>  Transport Adapter List:                          e1000g2, e1000g3
>>>>
>>>> root at xCid:~#
>>>>
>>>>
>>>> ******  Reboot 1st node xCid, all of the resources transfer to 2nd node
>>>> xCloud  and online on node xCloud  ************
>>>>
>>>> root at xCloud:~# scstat -p
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Cluster Nodes --
>>>>
>>>>                    Node name           Status
>>>>                    ---------           ------
>>>>  Cluster node:     xCid                Online
>>>>  Cluster node:     xCloud              Online
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Cluster Transport Paths --
>>>>
>>>>                    Endpoint               Endpoint               Status
>>>>                    --------               --------               ------
>>>>  Transport path:   xCid:e1000g3           xCloud:e1000g3         Path
>>>> online
>>>>  Transport path:   xCid:e1000g2           xCloud:e1000g2         Path
>>>> online
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Quorum Summary from latest node reconfiguration --
>>>>
>>>>  Quorum votes possible:      3
>>>>  Quorum votes needed:        2
>>>>  Quorum votes present:       3
>>>>
>>>>
>>>> -- Quorum Votes by Node (current status) --
>>>>
>>>>                    Node Name           Present Possible Status
>>>>                    ---------           ------- -------- ------
>>>>  Node votes:       xCid                1        1       Online
>>>>  Node votes:       xCloud              1        1       Online
>>>>
>>>>
>>>> -- Quorum Votes by Device (current status) --
>>>>
>>>>                    Device Name         Present Possible Status
>>>>                    -----------         ------- -------- ------
>>>>  Device votes:     Auron               1        1       Online
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Device Group Servers --
>>>>
>>>>                         Device Group        Primary             Secondary
>>>>                         ------------        -------             ---------
>>>>
>>>>
>>>> -- Device Group Status --
>>>>
>>>>                              Device Group        Status
>>>>                              ------------        ------
>>>>
>>>>
>>>> -- Multi-owner Device Groups --
>>>>
>>>>                              Device Group        Online Status
>>>>                              ------------        -------------
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- Resource Groups and Resources --
>>>>
>>>>            Group Name     Resources
>>>>            ----------     ---------
>>>>  Resources: xCloud-rg      xCloud-nfsres r-nfs
>>>>  Resources: nfs-rg         nfs-lh-rs nfs-hastp-rs nfs-rs
>>>>
>>>>
>>>> -- Resource Groups --
>>>>
>>>>            Group Name     Node Name                State
>>>>  Suspended
>>>>            ----------     ---------                -----
>>>>  ---------
>>>>     Group: xCloud-rg      xCid                     Offline        No
>>>>     Group: xCloud-rg      xCloud                   Online         No
>>>>
>>>>     Group: nfs-rg         xCid                     Offline        No
>>>>     Group: nfs-rg         xCloud                   Online         No
>>>>
>>>>
>>>> -- Resources --
>>>>
>>>>            Resource Name  Node Name                State          Status
>>>> Message
>>>>            -------------  ---------                -----
>>>>  --------------
>>>>  Resource: xCloud-nfsres  xCid                     Offline        Offline
>>>>  Resource: xCloud-nfsres  xCloud                   Online         Online
>>>> - LogicalHostname online.
>>>>
>>>>  Resource: r-nfs          xCid                     Offline        Offline
>>>>  Resource: r-nfs          xCloud                   Online         Online
>>>> - Service is online.
>>>>
>>>>  Resource: nfs-lh-rs      xCid                     Offline        Offline
>>>>  Resource: nfs-lh-rs      xCloud                   Online         Online
>>>> - LogicalHostname online.
>>>>
>>>>  Resource: nfs-hastp-rs   xCid                     Offline        Offline
>>>>  Resource: nfs-hastp-rs   xCloud                   Online         Online
>>>>
>>>>  Resource: nfs-rs         xCid                     Offline        Offline
>>>>  Resource: nfs-rs         xCloud                   Online         Online
>>>> - Service is online.
>>>>
>>>> ------------------------------------------------------------------
>>>>
>>>> -- IPMP Groups --
>>>>
>>>>              Node Name           Group   Status         Adapter   Status
>>>>              ---------           -----   ------         -------   ------
>>>>  IPMP Group: xCid                sc_ipmp0 Online         e1000g1   Online
>>>>
>>>>  IPMP Group: xCloud              sc_ipmp0 Online         e1000g0   Online
>>>>
>>>>
>>>> -- IPMP Groups in Zones --
>>>>
>>>>              Zone Name           Group   Status         Adapter   Status
>>>>              ---------           -----   ------         -------   ------
>>>> ------------------------------------------------------------------
>>>> root at xCloud:~#
>>>>
>>>>
>>>> ***********Wait for about 5 minutes, then reboot 2nd node xCloud and
>>>>  node xCid panic with the error below *********************
>>>>
>>>> root at xCid:~# Notifying cluster that this node is panicking
>>>> WARNING: CMM: Reading reservation keys from quorum device Auron failed
>>>> with error 2.
>>>>
>>>> panic[cpu0]/thread=ffffff02d0a623c0: CMM: Cluster lost operational
>>>> quorum; aborting.
>>>>
>>>> ffffff0011976b50 genunix:vcmn_err+2c ()
>>>> ffffff0011976b60
>>>> cl_runtime:__1cZsc_syslog_msg_log_no_args6FpviipkcpnR__va_list_element__nZsc_syslog_msg_status_enum__+1f
>>>> ()
>>>> ffffff0011976c40
>>>> cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_enum__+8c
>>>> ()
>>>> ffffff0011976e30
>>>> cl_haci:__1cOautomaton_implbAstate_machine_qcheck_state6M_nVcmm_automaton_event_t__+57f
>>>> ()
>>>> ffffff0011976e70 cl_haci:__1cIcmm_implStransitions_thread6M_v_+b7 ()
>>>> ffffff0011976e80 cl_haci:__1cIcmm_implYtransitions_thread_start6Fpv_v_+9
>>>> ()
>>>> ffffff0011976ed0 cl_orb:cllwpwrapper+d7 ()
>>>> ffffff0011976ee0 unix:thread_start+8 ()
>>>>
>>>> syncing file systems... done
>>>> dumping to /dev/zvol/dsk/rpool/dump, offset 65536, content: kernel
>>>>  51% done[2mMIdoOe
>>>>
>>>> the host log is attched.
>>>>
>>>> I have gone thru the SunCluster doc  on how to setup SunCluster for
>>>> OpenSolaris multiple times, but I don't see any steps that I miss.  Can you
>>>> please help to see if this is setup issue or it is a bug?
>>>>
>>>> Thanks,
>>>>
>>>> Janey
>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> ha-clusters-discuss mailing list
>>>> ha-clusters-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss
>>>>
>>>>         
>> _______________________________________________
>> ha-clusters-discuss mailing list
>> ha-clusters-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss
>>
>>     
> _______________________________________________
> ha-clusters-discuss mailing list
> ha-clusters-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/ha-clusters-discuss
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20090914/599764b2/attachment-0001.html>

[ha-clusters-discuss] Host panic - OpenSolaris SunCluster

Reply via email to