Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

Ryan Ratliff (rratliff) Fri, 29 Jan 2016 06:29:24 -0800

Until you identify what specifically spiked then you don’t know what to prevent.


Do you have a TAC case open yet?

-Ryan

On Jan 28, 2016, at 7:42 PM, Erick Bergquist 
<erick...@gmail.com<mailto:erick...@gmail.com>> wrote:

Thanks Ryan. Nothing really standing out on the ESXi side.

Any suggestions to prevent this in future?

The VMware KB articles suggest a higher timeout on application side so 
transient issues and conditions on the VM side of things don't negatively 
impact the application.

On Friday, January 22, 2016, Ryan Ratliff (rratliff) 
<rratl...@cisco.com<mailto:rratl...@cisco.com>> wrote:
That's your culprit right there. That 16 second sdlrouter thread timer is one 
of the few things I bet hadn't changed since 4.x.

Can you look at esxi stats for that time period to see what the host thought 
was going on?

-Ryan

> On Jan 22, 2016, at 4:45 PM, Erick Bergquist 
> <erick...@gmail.com<javascript:;>> wrote:
>
> Normal CPU usage up to core, then a CPU spike right after core when
> things are starting back up. Different hosts, same storage.
>
> Only gap in trace file line entries is 23:54:59.957 to 23:55:18.402
> but maybe no activities?
>
> 11574290.000 |23:54:59.957 |SdlStat  |Period: 6s #Lines: 19 #Bytes:
> 3701 Total Number of Buffers: 10000 Free LWM: 9989 Free LWM(total):
> 9192
> 11574291.000 |23:55:18.402 |AppInfo
> |SIPSocketProtocol(2,100,14,137)::handleReadComplete send SdlReadRsp:
> size 370
> 11574292.000 |23:55:18.402 |SdlStat  |Period: 19s #Lines: 1 #Bytes:
> 117 Total Number of Buffers: 10000 Free LWM: 9998 Free LWM(total):
> 9192
> 11574293.000 |23:55:18.959 |AppInfo  |CMProcMon - ERROR  !!!!!!!! SDL
> Router Services Dead: timeBeforeSleep: 804509608, currentMontiorTime:
> 804530608, deltaTime: 21000, LoopCount= 19
>
> The deltaTime is 21000 which would be more then the 20 seconds for the
> router thread in the previously mentioned document covering 3.x/4.x.
>
> Do these services shutdown and restart themselves if over 16 seconds
> (SDL Timer) or 20 seconds (Router Thread) still in 10.x like that
> document states at those same thresholds?
>
> VMWare KB suggests the SCSI timeout should be increased to handle
> transient conditions. I would think these timers should be able to
> handle transient issues without failing the service?  Maybe the
> thresholds need to be increased to handle these periods.
>
> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009465
>
> I've only seen this with the ccm and ctimanager service. None of the
> other services.
>
> Thanks.
>
>
> On Fri, Jan 22, 2016 at 2:44 PM, Ryan Ratliff (rratliff)
> <rratl...@cisco.com<javascript:;>> wrote:
>> How about CPU usage leading up to the cores?
>> If you have CCM traces from before are there gaps in the timestamps between 
>> line numbers?
>>
>> If you are virtualized were those two nodes on the same host?  If so did 
>> other VMs on that host experience any odd issues at the time?
>> -Ryan
>>
>> On Jan 22, 2016, at 3:29 PM, Erick Bergquist 
>> <erick...@gmail.com<javascript:;>> wrote:
>>
>> Ryan,
>>
>> I'm seeing a small spike in IOWait but not for that long, No gaps in
>> the Perfmon logs.  Only the ccm service core dumped on both the call
>> manager nodes on this cluster at the same time.
>>
>> Would the system shutdown the CCM and CTI Manager service itself if
>> these timers are to long, like that document from CCM 3.x/4.x days
>> eludes to with the 16 second SDL Timer Thread and 20 Second Router
>> Thread threshold?
>>
>>
>> The subscriber had a different backtrace, Intentional Abort..
>>
>> ====================================
>> backtrace - CUCM
>> ===================================
>> #0  0xf776c430 in __kernel_vsyscall ()
>> #1  0xf697b871 in raise () from /lib/libc.so.6
>> #2  0xf697d14a in abort () from /lib/libc.so.6
>> #3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
>> #4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
>> #5  0x083762ca in CMProcMon::callManagerMonitorThread
>> (cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
>> #6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
>> OS_Thread_Adapter.cpp:103
>> #7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
>> Base_Thread_Adapter.cpp:126
>> #8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
>> #9  0xf6a33c1e in clone () from /lib/libc.so.6
>> ====================================
>> info registers
>> ====================================
>>
>>
>> Thanks,
>> Erick
>>
>>
>> On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
>> <rratl...@cisco.com<javascript:;>> wrote:
>>> That backtrace is usually accompanied by CTIManager crashing due to 
>>> intentional abort.
>>>
>>>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>>>> - Test Timer exceeded minimum timer latency threshold of 1000
>>>> milliseconds, Actual latency: 54360 milliseconds
>>>
>>> That is quite spectacularly bad.  The SDLTimer thread is responsible for 
>>> _every_ timer the ccm process uses. It has a one-second self-check to make 
>>> sure it’s getting enough cpu time and in your case it didn’t get to execute 
>>> it’s one-second check for 54 seconds.
>>>
>>> This was definitely performance related so check CPU usage and IOWait at 
>>> the time (+- a minute or so). Gaps in the RISDC perfmon logs would also 
>>> indicate performance issues, though leave you with little evidence unless 
>>> it happened to catch some process at the beginning of a spike.
>>>
>>> -Ryan
>>>
>>> On Jan 22, 2016, at 1:13 PM, Erick Bergquist 
>>> <erick...@gmail.com<javascript:;>> wrote:
>>>
>>> Brian,
>>>
>>> Looked over those and not seeing any processes consuming lots of
>>> memory, memory used was 70-75% during this time.
>>>
>>>
>>>> On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade 
>>>> <bmead...@vt.edu<javascript:;>> wrote:
>>>> Did you download Perfmon logs?  Might just be high memory in general 
>>>> causing
>>>> the issue.  I'd use the Perfmon logs to check what processes were using up
>>>> all the memory at that time.
>>>>
>>>> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist 
>>>> <erick...@gmail.com<javascript:;>>
>>>> wrote:
>>>>>
>>>>> This happened on CUCM 10.5.2.10000-7 and I'm not finding a bug for
>>>>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release
>>>>> notes.
>>>>>
>>>>>
>>>>> From the core analysis,
>>>>>
>>>>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
>>>>> found)...done.
>>>>> Loaded symbols for /lib/libnss_files.so.2
>>>>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
>>>>> found)...done.
>>>>> Loaded symbols for /lib/libnss_dns.so.2
>>>>> Core was generated by `/usr/local/cm/bin/ccm'.
>>>>> Program terminated with signal 11, Segmentation fault.
>>>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>>>>
>>>>> ====================================
>>>>> backtrace - CUCM
>>>>> ===================================
>>>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>>>> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
>>>>> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
>>>>> (this=0xd7d74f8, severity=6, strBuf=<value optimized out>) at
>>>>> AlternateSyslogAlarmMonitor.cpp:162
>>>>> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
>>>>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
>>>>> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
>>>>> OS_Thread_Adapter.cpp:103
>>>>> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
>>>>> Base_Thread_Adapter.cpp:126
>>>>> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
>>>>> #7  0xf6a97c1e in clone () from /lib/libc.so.6
>>>>> ====================================
>>>>> info registers
>>>>> ====================================
>>>>> eax            0x198408
>>>>> ecx            0xf6b483d0-155941936
>>>>> edx            0xf6b483a0-155941984
>>>>> ebx            0xf6b46ff4-155947020
>>>>> esp            0xf5d0ab100xf5d0ab10
>>>>> ebp            0xf5d0ab980xf5d0ab98
>>>>> esi            0xf4bc0b8256622776
>>>>> edi            0x00
>>>>> eip            0xf6a28b550xf6a28b55 <_int_malloc+725>
>>>>> eflags         0x10212[ AF IF RF ]
>>>>> cs             0x2335
>>>>> ss             0x2b43
>>>>> ds             0x2b43
>>>>> es             0x2b43
>>>>> fs             0x00
>>>>> gs             0x6399
>>>>> ====================================
>>>>> info threads
>>>>> ====================================
>>>>> 47 Thread 0xe750bb70 (LWP 30608)  0xf6a237f4 in _IO_flush_all_lockp
>>>>> () from /lib/libc.so.6
>>>>> 46 Thread 0xf5effb70 (LWP 29077)  0xf77d0430 in __kernel_vsyscall ()
>>>>> 45 Thread 0xf63ffb70 (LWP 29052)  0xf77d0430 in __kernel_vsyscall ()
>>>>> 44 Thread 0xf67d0b70 (LWP 29048)  0xf77d0430 in __kernel_vsyscall ()
>>>>> 43 Thread 0xf67e0710 (LWP 25984)  0xf77d0430 in __kernel_vsyscall ()
>>>>> 42 Thread 0xe6f2fb70 (LWP 30611)  0xf77d0430 in __kernel_vsyscall ()
>>>>>
>>>>>
>>>>> No errors/etc in the Syslog at the time other then errors on the call
>>>>> manager service failing.
>>>>>
>>>>> However, the SDL CCM traces show of a timer expiring and appears it
>>>>> shuts down services due to the timer issue.
>>>>>
>>>>> From the following cisco doc,
>>>>>
>>>>> http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html
>>>>> it looks like the SDL timer thread stopped and service shutdown?
>>>>> This doc is from CCM 3.x and 4.x time frame.  Are the timers mentioned
>>>>> in this doc still true for 10.5 - where it shuts down at 16 seconds
>>>>> for SDL timer thread or 20 seconds for SDL router thread.
>>>>>
>>>>>
>>>>> 09465137.000 |23:55:15.416 |Stopping |
>>>>>    |                               |SdlTCPConnection(1,100,14,37)
>>>>> |SdlTCPConnection(1,100,14,37)    |
>>>>>    |NumOfCurrentInstances: 39
>>>>>
>>>>> 09465138.000 |23:55:15.448 |SdlSig   |SdlVerificationTimer
>>>>>    |monitoring                     |CMProcMon(1,100,234,1)
>>>>> |SdlTimerService(1,100,3,1)       |1,100,234,1.3139616^*^*
>>>>>    |*TraceFlagOverrode
>>>>>
>>>>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>>>>> - Test Timer exceeded minimum timer latency threshold of 1000
>>>>> milliseconds, Actual latency: 54360 milliseconds
>>>>>
>>>>> 09465139.000 |23:55:16.135 |AppInfo  |CallManagerFailure - Indicates
>>>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
>>>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text: App
>>>>> ID:Cisco CallManager Cluster ID:StandAloneCluster Node ID:CUCM1
>>>>>
>>>>> 09465140.000 |23:55:16.135 |AlarmErr |AlarmClass: CallManager,
>>>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
>>>>> , AlarmDescription: Indicates an internal failure in Unified CM,
>>>>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
>>>>> Text:, AppID:Cisco CallManager, ClusterID:StandAloneCluster,
>>>>> NodeID:CUCM1
>>>>>
>>>>> 09465141.000 |23:55:17.465 |AppInfo  |CallManagerFailure - Indicates
>>>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
>>>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text:CCM
>>>>> Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
>>>>> CMProcMon[1:100:234:1] App ID:Cisco CallManager Cluster
>>>>> ID:StandAloneCluster Node ID:CUCM1
>>>>>
>>>>> 09465142.000 |23:55:17.465 |AlarmErr |AlarmClass: CallManager,
>>>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
>>>>> , AlarmDescription: Indicates an internal failure in Unified CM,
>>>>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
>>>>> Text:CCM Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
>>>>> CMProcMon[1:100:234:1], AppID:Cisco CallManager,
>>>>> ClusterID:StandAloneCluster, NodeID:CUCM1,
>>>>>
>>>>> 09465138.002 |23:55:17.566 |AppInfo  |TimerThreadSlowed - Timer thread
>>>>> has slowed beyond acceptable limits App ID:Cisco CallManager Cluster
>>>>> ID:StandAloneCluster Node ID:CUCM1
>>>>>
>>>>> 09465138.003 |23:55:17.566 |AlarmErr |AlarmClass: CallManager,
>>>>> AlarmName: TimerThreadSlowed, AlarmSeverity: Critical, AlarmMessage: ,
>>>>> AlarmDescription: Timer thread has slowed beyond acceptable limits,
>>>>> AlarmParameters:  AppID:Cisco CallManager,
>>>>> ClusterID:StandAloneCluster, NodeID:CUCM1,
>>>>>
>>>>> 09465143.000 |23:55:17.566 |SdlSig   |ReapOldTokenRegistrationsTimer
>>>>>    |wait                           |SIPStationInit(1,100,72,1)
>>>>> |SdlTimerService(1,100,3,1)       |1,100,146,1.1^*^*
>>>>>    |[R:H-H:36,N:55,L:1,V:1,Z:0,D:0]
>>>>> _______________________________________________
>>>>> cisco-voip mailing list
>>>>> cisco-voip@puck.nether.net<javascript:;>
>>>>> https://puck.nether.net/mailman/listinfo/cisco-voip
>>> _______________________________________________
>>> cisco-voip mailing list
>>> cisco-voip@puck.nether.net<javascript:;>
>>> https://puck.nether.net/mailman/listinfo/cisco-voip
>>

_______________________________________________
cisco-voip mailing list
cisco-voip@puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-voip

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

Reply via email to