[ha-clusters-discuss] Unexpected panic on single node cluster

Martin Rattner Thu, 04 Feb 2010 13:43:33 -0800

Stacy,

Failfast is enabled on pmfd, so that if the daemon dies, it will panic 
the node.  This makes sense in a multi-node cluster, but not so much on 
a single-node cluster.  You can temporarily disable failfast on your 
single node until the next reboot, by running the following command:


    /usr/cluster/lib/sc/cmm_ctl -f

pmfd should have dropped a core file when it died.  Execute 'coreadm' to 
get a clue where the core file might be located.  For further diagnosis, 
I would want to see the output produced by executing 'pstack' on that 
core file.

--Marty

On 02/ 4/10 01:16 PM, Hartmut Streppel wrote:
> Hi Stacy,
> the log file shows that prior to pmfd disappearing, your acsls-rg had 
> severe problems. Although it is not directly obvious that this has 
> caused pmfd to die, I would first diagnose the RG problem. What is the 
> Failover_mode property of acsls-rg set to?
>
> Regards Hartmut
>
> Stacy Maydew schrieb:
>> Hi all,
>>
>> Running Opensolaris 2009.06 and OCHA 2009.06 on an x64 machine.
>>
>> We're trying to setup and test a single-node cluster and during the 
>> tests that online/offline the services under cluster control, the 
>> system occasionally panics unexpectedly.  Any insights would be 
>> greatly appreciated.
>>
>> The following error message is generated:
>>
>> 656416 libsecurity, door_call: Fatal, the server is not available.
>>
>> *Description: *
>>
>> The client (libpmf/libfe/libscha) is trying to communicate with the 
>> server (rpc.pmfd/rpc.fed/rgmd) but is failing because the server 
>> might be down.
>>
>> *Solution: *
>>
>> Save the /var/adm/messages files on each node. Contact your 
>> authorized Sun service provider to determine whether a workaround or 
>> patch is available.
>>
>> ----------------------------------------------------------------------------------------------
>>  
>>
>> The following error messages appear at the time of the panic in 
>> /var/adm/messages:
>>
>> Feb  2 10:54:07 vdev30ga Cluster.RGM.global.rgmd: [ID 424774 
>> daemon.error] Resource group <acsls-rg> requires operator attention 
>> due to STOP failure
>> Feb  2 10:54:30 vdev30ga unix: [ID 836849 kern.notice]
>> Feb  2 10:54:30 vdev30ga ^Mpanic[cpu0]/thread=ffffff001e8bfc60:
>> Feb  2 10:54:30 vdev30ga genunix: [ID 562397 kern.notice] Failfast: 
>> Aborting zone "global" (zone ID 0) because "pmfd" died 35 seconds ago.
>> Feb  2 10:54:30 vdev30ga unix: [ID 100000 kern.notice]
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bf8c0 genunix:vcmn_err+2c ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bf8d0 
>> cl_runtime:__1cZsc_syslog_msg_log_no_args6FpviipkcpnR__va_list_element__nZsc_syslog_msg_status_enum__+1f
>>  
>> ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bf9b0 
>> cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_enum__+8c
>>  
>> ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bf9e0 cl_haci:__1cHff_implPstop_node_panic6M_v_+3b4 ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfa00 cl_haci:__1cHff_implNunit_timedout6M_v_+53 ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfa20 cl_haci:__1cLff_timedout6Fpc_v_+11 ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfa70 
>> cl_haci:__1cQff_callout_tableTper_tick_processing6F_v_+c7 ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfaa0 
>> cl_haci:__1cNff_admin_implWsc_per_tick_processing6Mn0AQcallout_caller_t__v_+83
>>  
>> ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfab0 cl_haci:__1cNff_admin_implQff_clock_callout6F_v_+12 ()
>> Feb  2 10:54:30 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfb10 genunix:clock+346 ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfbc0 genunix:cyclic_softint+dc ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfbd0 unix:cbe_softclock+1a ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfc10 unix:av_dispatch_softvect+5f ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e8bfc40 unix:dispatch_softint+34 ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805a60 unix:switch_sp_and_call+13 ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805a90 unix:dosoftint+59 ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805ae0 unix:do_interrupt+fc ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805af0 unix:cmnint+ba ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805be0 unix:mach_cpu_idle+b ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805c10 unix:cpu_idle+c0 ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805c20 unix:cpu_idle_adaptive+19 ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805c40 unix:idle+114 ()
>> Feb  2 10:54:31 vdev30ga genunix: [ID 655072 kern.notice] 
>> ffffff001e805c50 unix:thread_start+8 ()
>> Feb  2 10:54:31 vdev30ga unix: [ID 100000 kern.notice]
>> Feb  2 10:54:31 vdev30ga genunix: [ID 672855 kern.notice] syncing 
>> file systems...
>> Feb  2 10:54:31 vdev30ga genunix: [ID 904073 kern.notice]  done
>>   
>

[ha-clusters-discuss] Unexpected panic on single node cluster

Reply via email to