[ha-clusters-discuss] Solaris sched process CPU time and pm_tick delay := Virtual machine uptime

Giovanni Tirloni Mon, 8 Mar 2010 22:52:10 -0300

On Mon, Mar 8, 2010 at 8:28 AM, Gabriel Campbell <gabcam at 
vodafone.com.mt>wrote:


>
> -bash-3.00$ ps -ef | grep sched
>    root     0     0   0 09:45:46 ?        4193:58 sched
>    root  1361     1   0 09:48:30 ?           0:00 zsched
>    root  1350     1   0 09:48:30 ?           0:00 zsched
>    apps 13216 28638   0 10:39:01 pts/3       0:00 grep sched
>
>
I'm not familiar with all the sources from where the cpu time might be
calculated. The mdb output you sent looks normal and it doesn't seem
responsible for this huge cpu time. Perhaps someone more experience with the
kernel internals can jump in here.



> Some reboots starting today:
>
> Mar  7 22:37:26 mt-so-clt01-glb02 genunix: [ID 665015 kern.notice] NOTICE:
> Scalable service instance [TCP,192.168.114.7,11014] registered on node
> mt-so-c
> lt01-glb02.
> Mar  8 02:47:17 mt-so-clt01-glb02 cl_dlpitrans: [ID 624622 kern.notice]
> Notifying cluster that this node is panicking
> Mar  8 02:47:17 mt-so-clt01-glb02 unix: [ID 836849 kern.notice]
> Mar  8 02:47:17 mt-so-clt01-glb02 ^Mpanic[cpu2]/thread=fffffe800043dc80:
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 898738 kern.notice] Aborting
> node because pm_tick delay of 226518361 ms exceeds 5050 ms
>

The comment right above the code that prints this message says:

http://src.opensolaris.org/source/xref/colorado/colorado/usr/src/common/cl/orb/transport/path_manager.cc

// The system is unable to send heartbeats for a long
// time. (This is half of the minimum of timeout
// values of all the paths. If the timeout values for
// all the paths is 10 secs then this value is 5
// secs.) There is probably heavy interrupt activity
// causing the clock thread to get delayed, which in
// turn causes irregular heartbeats. The node is
// aborted because it is considered to be in 'sick'
// condition and it is better to abort this node
// instead of causing other nodes (or the cluster) to
// go down.
// @user_action
// Check to see what is causing high interrupt
// activity and configure the system accordingly.
//

Has anything changed in your VMware ESX instalation that would cause
interrupts to increase OR time drifts ?

I'm not familiar with ESX but judging from the messages I see in the Linux
mailing lists, people have had lots of problems with time syncronization.
Perhaps the VMware KnowledgeBase website can help figuring out if you need
any special configuration. The "problem" is that the other 3 machines are
working fine.

This may help:
http://72.5.124.102/thread.jspa?threadID=5246011&messageID=10016682



> Mar  8 02:47:17 mt-so-clt01-glb02 unix: [ID 100000 kern.notice]
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043d810 genunix:vcmn_err+13 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043d820
> cl_runtime:__1cZsc_syslog_msg_log_no_args6FpviipkcpnR__va_list_elemen
> t__nZsc_syslog_msg_status_enum__+24 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043d900
> cl_runtime:__1cCosNsc_syslog_msgDlog6MiipkcE_nZsc_syslog_msg_status_e
> num__+9d ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043d990
> cl_comm:__1cMpath_managerHpm_tick6Mn0APcyclic_caller_t__v_+17b ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043d9a0
> cl_comm:__1cbDpath_manager_cyclic_interface6FnMpath_managerPcyclic_ca
> ller_t__v_+14 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043d9d0 cl_comm:__1cNhb_threadpoolOsend_heartbeat6M_v_+3a ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043d9e0 cl_comm:hb_threadpool_send_heartbeat_wrapper+12 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043da00 clhbsndr:hbsndr_rput+1f ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043da60 unix:putnext+1f1 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043da90 dld:dld_str_rx_fastpath+24 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043db50 dls:i_dls_link_rx+18c ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043dba0 mac:mac_rx+71 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043dbf0 e1000g:e1000g_intr_work+d9 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043dc10 e1000g:e1000g_intr+5b ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043dc60 unix:av_dispatch_autovect+78 ()
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 655072 kern.notice]
> fffffe800043dc70 unix:intr_thread+5f ()
> Mar  8 02:47:17 mt-so-clt01-glb02 unix: [ID 100000 kern.notice]
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 672855 kern.notice] syncing
> file systems...
> Mar  8 02:47:17 mt-so-clt01-glb02 genunix: [ID 733762 kern.notice]  11
> Mar  8 02:47:18 mt-so-clt01-glb02 genunix: [ID 733762 kern.notice]  8
> Mar  8 02:47:19 mt-so-clt01-glb02 genunix: [ID 733762 kern.notice]  7
> Mar  8 02:47:42 mt-so-clt01-glb02 last message repeated 20 times
> Mar  8 02:47:43 mt-so-clt01-glb02 genunix: [ID 622722 kern.notice]  done
> (not all i/o completed)
> Mar  8 02:47:44 mt-so-clt01-glb02 genunix: [ID 111219 kern.notice] dumping
> to /dev/dsk/c1t0d0s1, offset 838991872, content: kernel
>

It might be useful to open a support case at Oracle/Sun and send the
contents of this dump.


-- 
Giovanni Tirloni
sysdroid.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.opensolaris.org/pipermail/ha-clusters-discuss/attachments/20100308/860aee07/attachment.html>

[ha-clusters-discuss] Solaris sched process CPU time and pm_tick delay := Virtual machine uptime

Reply via email to