Until you identify what specifically spiked then you don’t know what to prevent.
Do you have a TAC case open yet? -Ryan On Jan 28, 2016, at 7:42 PM, Erick Bergquist <erick...@gmail.com<mailto:erick...@gmail.com>> wrote: Thanks Ryan. Nothing really standing out on the ESXi side. Any suggestions to prevent this in future? The VMware KB articles suggest a higher timeout on application side so transient issues and conditions on the VM side of things don't negatively impact the application. On Friday, January 22, 2016, Ryan Ratliff (rratliff) <rratl...@cisco.com<mailto:rratl...@cisco.com>> wrote: That's your culprit right there. That 16 second sdlrouter thread timer is one of the few things I bet hadn't changed since 4.x. Can you look at esxi stats for that time period to see what the host thought was going on? -Ryan > On Jan 22, 2016, at 4:45 PM, Erick Bergquist > <erick...@gmail.com<javascript:;>> wrote: > > Normal CPU usage up to core, then a CPU spike right after core when > things are starting back up. Different hosts, same storage. > > Only gap in trace file line entries is 23:54:59.957 to 23:55:18.402 > but maybe no activities? > > 11574290.000 |23:54:59.957 |SdlStat |Period: 6s #Lines: 19 #Bytes: > 3701 Total Number of Buffers: 10000 Free LWM: 9989 Free LWM(total): > 9192 > 11574291.000 |23:55:18.402 |AppInfo > |SIPSocketProtocol(2,100,14,137)::handleReadComplete send SdlReadRsp: > size 370 > 11574292.000 |23:55:18.402 |SdlStat |Period: 19s #Lines: 1 #Bytes: > 117 Total Number of Buffers: 10000 Free LWM: 9998 Free LWM(total): > 9192 > 11574293.000 |23:55:18.959 |AppInfo |CMProcMon - ERROR !!!!!!!! SDL > Router Services Dead: timeBeforeSleep: 804509608, currentMontiorTime: > 804530608, deltaTime: 21000, LoopCount= 19 > > The deltaTime is 21000 which would be more then the 20 seconds for the > router thread in the previously mentioned document covering 3.x/4.x. > > Do these services shutdown and restart themselves if over 16 seconds > (SDL Timer) or 20 seconds (Router Thread) still in 10.x like that > document states at those same thresholds? > > VMWare KB suggests the SCSI timeout should be increased to handle > transient conditions. I would think these timers should be able to > handle transient issues without failing the service? Maybe the > thresholds need to be increased to handle these periods. > > http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009465 > > I've only seen this with the ccm and ctimanager service. None of the > other services. > > Thanks. > > > On Fri, Jan 22, 2016 at 2:44 PM, Ryan Ratliff (rratliff) > <rratl...@cisco.com<javascript:;>> wrote: >> How about CPU usage leading up to the cores? >> If you have CCM traces from before are there gaps in the timestamps between >> line numbers? >> >> If you are virtualized were those two nodes on the same host? If so did >> other VMs on that host experience any odd issues at the time? >> -Ryan >> >> On Jan 22, 2016, at 3:29 PM, Erick Bergquist >> <erick...@gmail.com<javascript:;>> wrote: >> >> Ryan, >> >> I'm seeing a small spike in IOWait but not for that long, No gaps in >> the Perfmon logs. Only the ccm service core dumped on both the call >> manager nodes on this cluster at the same time. >> >> Would the system shutdown the CCM and CTI Manager service itself if >> these timers are to long, like that document from CCM 3.x/4.x days >> eludes to with the 16 second SDL Timer Thread and 20 Second Router >> Thread threshold? >> >> >> The subscriber had a different backtrace, Intentional Abort.. >> >> ==================================== >> backtrace - CUCM >> =================================== >> #0 0xf776c430 in __kernel_vsyscall () >> #1 0xf697b871 in raise () from /lib/libc.so.6 >> #2 0xf697d14a in abort () from /lib/libc.so.6 >> #3 0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88 >> #4 CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748 >> #5 0x083762ca in CMProcMon::callManagerMonitorThread >> (cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429 >> #6 0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at >> OS_Thread_Adapter.cpp:103 >> #7 0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at >> Base_Thread_Adapter.cpp:126 >> #8 0xf6932b39 in start_thread () from /lib/libpthread.so.0 >> #9 0xf6a33c1e in clone () from /lib/libc.so.6 >> ==================================== >> info registers >> ==================================== >> >> >> Thanks, >> Erick >> >> >> On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff) >> <rratl...@cisco.com<javascript:;>> wrote: >>> That backtrace is usually accompanied by CTIManager crashing due to >>> intentional abort. >>> >>>> 09465138.001 |23:55:15.448 |AppInfo |CMProcMon::star_sdlVerification >>>> - Test Timer exceeded minimum timer latency threshold of 1000 >>>> milliseconds, Actual latency: 54360 milliseconds >>> >>> That is quite spectacularly bad. The SDLTimer thread is responsible for >>> _every_ timer the ccm process uses. It has a one-second self-check to make >>> sure it’s getting enough cpu time and in your case it didn’t get to execute >>> it’s one-second check for 54 seconds. >>> >>> This was definitely performance related so check CPU usage and IOWait at >>> the time (+- a minute or so). Gaps in the RISDC perfmon logs would also >>> indicate performance issues, though leave you with little evidence unless >>> it happened to catch some process at the beginning of a spike. >>> >>> -Ryan >>> >>> On Jan 22, 2016, at 1:13 PM, Erick Bergquist >>> <erick...@gmail.com<javascript:;>> wrote: >>> >>> Brian, >>> >>> Looked over those and not seeing any processes consuming lots of >>> memory, memory used was 70-75% during this time. >>> >>> >>>> On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade >>>> <bmead...@vt.edu<javascript:;>> wrote: >>>> Did you download Perfmon logs? Might just be high memory in general >>>> causing >>>> the issue. I'd use the Perfmon logs to check what processes were using up >>>> all the memory at that time. >>>> >>>> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist >>>> <erick...@gmail.com<javascript:;>> >>>> wrote: >>>>> >>>>> This happened on CUCM 10.5.2.10000-7 and I'm not finding a bug for >>>>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release >>>>> notes. >>>>> >>>>> >>>>> From the core analysis, >>>>> >>>>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols >>>>> found)...done. >>>>> Loaded symbols for /lib/libnss_files.so.2 >>>>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols >>>>> found)...done. >>>>> Loaded symbols for /lib/libnss_dns.so.2 >>>>> Core was generated by `/usr/local/cm/bin/ccm'. >>>>> Program terminated with signal 11, Segmentation fault. >>>>> #0 0xf6a28b55 in _int_malloc () from /lib/libc.so.6 >>>>> >>>>> ==================================== >>>>> backtrace - CUCM >>>>> =================================== >>>>> #0 0xf6a28b55 in _int_malloc () from /lib/libc.so.6 >>>>> #1 0xf6a29d9e in malloc () from /lib/libc.so.6 >>>>> #2 0x09be122c in AlternateSyslogAlarmMonitor::manageLogs >>>>> (this=0xd7d74f8, severity=6, strBuf=<value optimized out>) at >>>>> AlternateSyslogAlarmMonitor.cpp:162 >>>>> #3 0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc >>>>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61 >>>>> #4 0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at >>>>> OS_Thread_Adapter.cpp:103 >>>>> #5 0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at >>>>> Base_Thread_Adapter.cpp:126 >>>>> #6 0xf6996b39 in start_thread () from /lib/libpthread.so.0 >>>>> #7 0xf6a97c1e in clone () from /lib/libc.so.6 >>>>> ==================================== >>>>> info registers >>>>> ==================================== >>>>> eax 0x198408 >>>>> ecx 0xf6b483d0-155941936 >>>>> edx 0xf6b483a0-155941984 >>>>> ebx 0xf6b46ff4-155947020 >>>>> esp 0xf5d0ab100xf5d0ab10 >>>>> ebp 0xf5d0ab980xf5d0ab98 >>>>> esi 0xf4bc0b8256622776 >>>>> edi 0x00 >>>>> eip 0xf6a28b550xf6a28b55 <_int_malloc+725> >>>>> eflags 0x10212[ AF IF RF ] >>>>> cs 0x2335 >>>>> ss 0x2b43 >>>>> ds 0x2b43 >>>>> es 0x2b43 >>>>> fs 0x00 >>>>> gs 0x6399 >>>>> ==================================== >>>>> info threads >>>>> ==================================== >>>>> 47 Thread 0xe750bb70 (LWP 30608) 0xf6a237f4 in _IO_flush_all_lockp >>>>> () from /lib/libc.so.6 >>>>> 46 Thread 0xf5effb70 (LWP 29077) 0xf77d0430 in __kernel_vsyscall () >>>>> 45 Thread 0xf63ffb70 (LWP 29052) 0xf77d0430 in __kernel_vsyscall () >>>>> 44 Thread 0xf67d0b70 (LWP 29048) 0xf77d0430 in __kernel_vsyscall () >>>>> 43 Thread 0xf67e0710 (LWP 25984) 0xf77d0430 in __kernel_vsyscall () >>>>> 42 Thread 0xe6f2fb70 (LWP 30611) 0xf77d0430 in __kernel_vsyscall () >>>>> >>>>> >>>>> No errors/etc in the Syslog at the time other then errors on the call >>>>> manager service failing. >>>>> >>>>> However, the SDL CCM traces show of a timer expiring and appears it >>>>> shuts down services due to the timer issue. >>>>> >>>>> From the following cisco doc, >>>>> >>>>> http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html >>>>> it looks like the SDL timer thread stopped and service shutdown? >>>>> This doc is from CCM 3.x and 4.x time frame. Are the timers mentioned >>>>> in this doc still true for 10.5 - where it shuts down at 16 seconds >>>>> for SDL timer thread or 20 seconds for SDL router thread. >>>>> >>>>> >>>>> 09465137.000 |23:55:15.416 |Stopping | >>>>> | |SdlTCPConnection(1,100,14,37) >>>>> |SdlTCPConnection(1,100,14,37) | >>>>> |NumOfCurrentInstances: 39 >>>>> >>>>> 09465138.000 |23:55:15.448 |SdlSig |SdlVerificationTimer >>>>> |monitoring |CMProcMon(1,100,234,1) >>>>> |SdlTimerService(1,100,3,1) |1,100,234,1.3139616^*^* >>>>> |*TraceFlagOverrode >>>>> >>>>> 09465138.001 |23:55:15.448 |AppInfo |CMProcMon::star_sdlVerification >>>>> - Test Timer exceeded minimum timer latency threshold of 1000 >>>>> milliseconds, Actual latency: 54360 milliseconds >>>>> >>>>> 09465139.000 |23:55:16.135 |AppInfo |CallManagerFailure - Indicates >>>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host >>>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text: App >>>>> ID:Cisco CallManager Cluster ID:StandAloneCluster Node ID:CUCM1 >>>>> >>>>> 09465140.000 |23:55:16.135 |AlarmErr |AlarmClass: CallManager, >>>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage: >>>>> , AlarmDescription: Indicates an internal failure in Unified CM, >>>>> AlarmParameters: HostName:CUCM1, IPAddress:192.168.1.5, Reason:4, >>>>> Text:, AppID:Cisco CallManager, ClusterID:StandAloneCluster, >>>>> NodeID:CUCM1 >>>>> >>>>> 09465141.000 |23:55:17.465 |AppInfo |CallManagerFailure - Indicates >>>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host >>>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text:CCM >>>>> Intentional Abort: SignalName: SdlVerificationTimer, DestPID: >>>>> CMProcMon[1:100:234:1] App ID:Cisco CallManager Cluster >>>>> ID:StandAloneCluster Node ID:CUCM1 >>>>> >>>>> 09465142.000 |23:55:17.465 |AlarmErr |AlarmClass: CallManager, >>>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage: >>>>> , AlarmDescription: Indicates an internal failure in Unified CM, >>>>> AlarmParameters: HostName:CUCM1, IPAddress:192.168.1.5, Reason:4, >>>>> Text:CCM Intentional Abort: SignalName: SdlVerificationTimer, DestPID: >>>>> CMProcMon[1:100:234:1], AppID:Cisco CallManager, >>>>> ClusterID:StandAloneCluster, NodeID:CUCM1, >>>>> >>>>> 09465138.002 |23:55:17.566 |AppInfo |TimerThreadSlowed - Timer thread >>>>> has slowed beyond acceptable limits App ID:Cisco CallManager Cluster >>>>> ID:StandAloneCluster Node ID:CUCM1 >>>>> >>>>> 09465138.003 |23:55:17.566 |AlarmErr |AlarmClass: CallManager, >>>>> AlarmName: TimerThreadSlowed, AlarmSeverity: Critical, AlarmMessage: , >>>>> AlarmDescription: Timer thread has slowed beyond acceptable limits, >>>>> AlarmParameters: AppID:Cisco CallManager, >>>>> ClusterID:StandAloneCluster, NodeID:CUCM1, >>>>> >>>>> 09465143.000 |23:55:17.566 |SdlSig |ReapOldTokenRegistrationsTimer >>>>> |wait |SIPStationInit(1,100,72,1) >>>>> |SdlTimerService(1,100,3,1) |1,100,146,1.1^*^* >>>>> |[R:H-H:36,N:55,L:1,V:1,Z:0,D:0] >>>>> _______________________________________________ >>>>> cisco-voip mailing list >>>>> cisco-voip@puck.nether.net<javascript:;> >>>>> https://puck.nether.net/mailman/listinfo/cisco-voip >>> _______________________________________________ >>> cisco-voip mailing list >>> cisco-voip@puck.nether.net<javascript:;> >>> https://puck.nether.net/mailman/listinfo/cisco-voip >>
_______________________________________________ cisco-voip mailing list cisco-voip@puck.nether.net https://puck.nether.net/mailman/listinfo/cisco-voip