Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-29 Thread Ryan Ratliff (rratliff)
Until you identify what specifically spiked then you don’t know what to prevent.

Do you have a TAC case open yet?

-Ryan

On Jan 28, 2016, at 7:42 PM, Erick Bergquist 
> wrote:

Thanks Ryan. Nothing really standing out on the ESXi side.

Any suggestions to prevent this in future?

The VMware KB articles suggest a higher timeout on application side so 
transient issues and conditions on the VM side of things don't negatively 
impact the application.

On Friday, January 22, 2016, Ryan Ratliff (rratliff) 
> wrote:
That's your culprit right there. That 16 second sdlrouter thread timer is one 
of the few things I bet hadn't changed since 4.x.

Can you look at esxi stats for that time period to see what the host thought 
was going on?

-Ryan

> On Jan 22, 2016, at 4:45 PM, Erick Bergquist 
> > wrote:
>
> Normal CPU usage up to core, then a CPU spike right after core when
> things are starting back up. Different hosts, same storage.
>
> Only gap in trace file line entries is 23:54:59.957 to 23:55:18.402
> but maybe no activities?
>
> 11574290.000 |23:54:59.957 |SdlStat  |Period: 6s #Lines: 19 #Bytes:
> 3701 Total Number of Buffers: 1 Free LWM: 9989 Free LWM(total):
> 9192
> 11574291.000 |23:55:18.402 |AppInfo
> |SIPSocketProtocol(2,100,14,137)::handleReadComplete send SdlReadRsp:
> size 370
> 11574292.000 |23:55:18.402 |SdlStat  |Period: 19s #Lines: 1 #Bytes:
> 117 Total Number of Buffers: 1 Free LWM: 9998 Free LWM(total):
> 9192
> 11574293.000 |23:55:18.959 |AppInfo  |CMProcMon - ERROR   SDL
> Router Services Dead: timeBeforeSleep: 804509608, currentMontiorTime:
> 804530608, deltaTime: 21000, LoopCount= 19
>
> The deltaTime is 21000 which would be more then the 20 seconds for the
> router thread in the previously mentioned document covering 3.x/4.x.
>
> Do these services shutdown and restart themselves if over 16 seconds
> (SDL Timer) or 20 seconds (Router Thread) still in 10.x like that
> document states at those same thresholds?
>
> VMWare KB suggests the SCSI timeout should be increased to handle
> transient conditions. I would think these timers should be able to
> handle transient issues without failing the service?  Maybe the
> thresholds need to be increased to handle these periods.
>
> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=1009465
>
> I've only seen this with the ccm and ctimanager service. None of the
> other services.
>
> Thanks.
>
>
> On Fri, Jan 22, 2016 at 2:44 PM, Ryan Ratliff (rratliff)
> > wrote:
>> How about CPU usage leading up to the cores?
>> If you have CCM traces from before are there gaps in the timestamps between 
>> line numbers?
>>
>> If you are virtualized were those two nodes on the same host?  If so did 
>> other VMs on that host experience any odd issues at the time?
>> -Ryan
>>
>> On Jan 22, 2016, at 3:29 PM, Erick Bergquist 
>> > wrote:
>>
>> Ryan,
>>
>> I'm seeing a small spike in IOWait but not for that long, No gaps in
>> the Perfmon logs.  Only the ccm service core dumped on both the call
>> manager nodes on this cluster at the same time.
>>
>> Would the system shutdown the CCM and CTI Manager service itself if
>> these timers are to long, like that document from CCM 3.x/4.x days
>> eludes to with the 16 second SDL Timer Thread and 20 Second Router
>> Thread threshold?
>>
>>
>> The subscriber had a different backtrace, Intentional Abort..
>>
>> 
>> backtrace - CUCM
>> ===
>> #0  0xf776c430 in __kernel_vsyscall ()
>> #1  0xf697b871 in raise () from /lib/libc.so.6
>> #2  0xf697d14a in abort () from /lib/libc.so.6
>> #3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
>> #4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
>> #5  0x083762ca in CMProcMon::callManagerMonitorThread
>> (cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
>> #6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
>> OS_Thread_Adapter.cpp:103
>> #7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
>> Base_Thread_Adapter.cpp:126
>> #8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
>> #9  0xf6a33c1e in clone () from /lib/libc.so.6
>> 
>> info registers
>> 
>>
>>
>> Thanks,
>> Erick
>>
>>
>> On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
>> > wrote:
>>> That backtrace is usually accompanied by CTIManager crashing due to 
>>> intentional abort.
>>>
 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
 - Test Timer exceeded minimum timer latency threshold of 1000
 milliseconds, Actual latency: 54360 milliseconds
>>>
>>> That is quite spectacularly bad.  The SDLTimer thread is responsible for 
>>> 

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-29 Thread Erick Bergquist
I'll unicast you the details Ryan.

On Fri, Jan 29, 2016 at 8:28 AM, Ryan Ratliff (rratliff)
 wrote:
> Until you identify what specifically spiked then you don’t know what to
> prevent.
>
> Do you have a TAC case open yet?
>
> -Ryan
>
> On Jan 28, 2016, at 7:42 PM, Erick Bergquist  wrote:
>
> Thanks Ryan. Nothing really standing out on the ESXi side.
>
> Any suggestions to prevent this in future?
>
> The VMware KB articles suggest a higher timeout on application side so
> transient issues and conditions on the VM side of things don't negatively
> impact the application.
>
> On Friday, January 22, 2016, Ryan Ratliff (rratliff) 
> wrote:
>>
>> That's your culprit right there. That 16 second sdlrouter thread timer is
>> one of the few things I bet hadn't changed since 4.x.
>>
>> Can you look at esxi stats for that time period to see what the host
>> thought was going on?
>>
>> -Ryan
>>
>> > On Jan 22, 2016, at 4:45 PM, Erick Bergquist  wrote:
>> >
>> > Normal CPU usage up to core, then a CPU spike right after core when
>> > things are starting back up. Different hosts, same storage.
>> >
>> > Only gap in trace file line entries is 23:54:59.957 to 23:55:18.402
>> > but maybe no activities?
>> >
>> > 11574290.000 |23:54:59.957 |SdlStat  |Period: 6s #Lines: 19 #Bytes:
>> > 3701 Total Number of Buffers: 1 Free LWM: 9989 Free LWM(total):
>> > 9192
>> > 11574291.000 |23:55:18.402 |AppInfo
>> > |SIPSocketProtocol(2,100,14,137)::handleReadComplete send SdlReadRsp:
>> > size 370
>> > 11574292.000 |23:55:18.402 |SdlStat  |Period: 19s #Lines: 1 #Bytes:
>> > 117 Total Number of Buffers: 1 Free LWM: 9998 Free LWM(total):
>> > 9192
>> > 11574293.000 |23:55:18.959 |AppInfo  |CMProcMon - ERROR   SDL
>> > Router Services Dead: timeBeforeSleep: 804509608, currentMontiorTime:
>> > 804530608, deltaTime: 21000, LoopCount= 19
>> >
>> > The deltaTime is 21000 which would be more then the 20 seconds for the
>> > router thread in the previously mentioned document covering 3.x/4.x.
>> >
>> > Do these services shutdown and restart themselves if over 16 seconds
>> > (SDL Timer) or 20 seconds (Router Thread) still in 10.x like that
>> > document states at those same thresholds?
>> >
>> > VMWare KB suggests the SCSI timeout should be increased to handle
>> > transient conditions. I would think these timers should be able to
>> > handle transient issues without failing the service?  Maybe the
>> > thresholds need to be increased to handle these periods.
>> >
>> >
>> > http://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=1009465
>> >
>> > I've only seen this with the ccm and ctimanager service. None of the
>> > other services.
>> >
>> > Thanks.
>> >
>> >
>> > On Fri, Jan 22, 2016 at 2:44 PM, Ryan Ratliff (rratliff)
>> >  wrote:
>> >> How about CPU usage leading up to the cores?
>> >> If you have CCM traces from before are there gaps in the timestamps
>> >> between line numbers?
>> >>
>> >> If you are virtualized were those two nodes on the same host?  If so
>> >> did other VMs on that host experience any odd issues at the time?
>> >> -Ryan
>> >>
>> >> On Jan 22, 2016, at 3:29 PM, Erick Bergquist 
>> >> wrote:
>> >>
>> >> Ryan,
>> >>
>> >> I'm seeing a small spike in IOWait but not for that long, No gaps in
>> >> the Perfmon logs.  Only the ccm service core dumped on both the call
>> >> manager nodes on this cluster at the same time.
>> >>
>> >> Would the system shutdown the CCM and CTI Manager service itself if
>> >> these timers are to long, like that document from CCM 3.x/4.x days
>> >> eludes to with the 16 second SDL Timer Thread and 20 Second Router
>> >> Thread threshold?
>> >>
>> >>
>> >> The subscriber had a different backtrace, Intentional Abort..
>> >>
>> >> 
>> >> backtrace - CUCM
>> >> ===
>> >> #0  0xf776c430 in __kernel_vsyscall ()
>> >> #1  0xf697b871 in raise () from /lib/libc.so.6
>> >> #2  0xf697d14a in abort () from /lib/libc.so.6
>> >> #3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
>> >> #4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
>> >> #5  0x083762ca in CMProcMon::callManagerMonitorThread
>> >> (cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
>> >> #6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
>> >> OS_Thread_Adapter.cpp:103
>> >> #7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
>> >> Base_Thread_Adapter.cpp:126
>> >> #8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
>> >> #9  0xf6a33c1e in clone () from /lib/libc.so.6
>> >> 
>> >> info registers
>> >> 
>> >>
>> >>
>> >> Thanks,
>> >> Erick
>> >>
>> >>
>> >> On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
>> >>  wrote:
>> >>> That backtrace is usually 

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-22 Thread Ryan Ratliff (rratliff)
How about CPU usage leading up to the cores? 
If you have CCM traces from before are there gaps in the timestamps between 
line numbers?  

If you are virtualized were those two nodes on the same host?  If so did other 
VMs on that host experience any odd issues at the time?
-Ryan

On Jan 22, 2016, at 3:29 PM, Erick Bergquist  wrote:

Ryan,

I'm seeing a small spike in IOWait but not for that long, No gaps in
the Perfmon logs.  Only the ccm service core dumped on both the call
manager nodes on this cluster at the same time.

Would the system shutdown the CCM and CTI Manager service itself if
these timers are to long, like that document from CCM 3.x/4.x days
eludes to with the 16 second SDL Timer Thread and 20 Second Router
Thread threshold?


The subscriber had a different backtrace, Intentional Abort..


backtrace - CUCM
===
#0  0xf776c430 in __kernel_vsyscall ()
#1  0xf697b871 in raise () from /lib/libc.so.6
#2  0xf697d14a in abort () from /lib/libc.so.6
#3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
#4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
#5  0x083762ca in CMProcMon::callManagerMonitorThread
(cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
#6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
OS_Thread_Adapter.cpp:103
#7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
Base_Thread_Adapter.cpp:126
#8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
#9  0xf6a33c1e in clone () from /lib/libc.so.6

info registers



Thanks,
Erick


On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
 wrote:
> That backtrace is usually accompanied by CTIManager crashing due to 
> intentional abort.
> 
>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>> - Test Timer exceeded minimum timer latency threshold of 1000
>> milliseconds, Actual latency: 54360 milliseconds
> 
> That is quite spectacularly bad.  The SDLTimer thread is responsible for 
> _every_ timer the ccm process uses. It has a one-second self-check to make 
> sure it’s getting enough cpu time and in your case it didn’t get to execute 
> it’s one-second check for 54 seconds.
> 
> This was definitely performance related so check CPU usage and IOWait at the 
> time (+- a minute or so). Gaps in the RISDC perfmon logs would also indicate 
> performance issues, though leave you with little evidence unless it happened 
> to catch some process at the beginning of a spike.
> 
> -Ryan
> 
> On Jan 22, 2016, at 1:13 PM, Erick Bergquist  wrote:
> 
> Brian,
> 
> Looked over those and not seeing any processes consuming lots of
> memory, memory used was 70-75% during this time.
> 
> 
> On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade  wrote:
>> Did you download Perfmon logs?  Might just be high memory in general causing
>> the issue.  I'd use the Perfmon logs to check what processes were using up
>> all the memory at that time.
>> 
>> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist 
>> wrote:
>>> 
>>> This happened on CUCM 10.5.2.1-7 and I'm not finding a bug for
>>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release
>>> notes.
>>> 
>>> 
>>> From the core analysis,
>>> 
>>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /lib/libnss_files.so.2
>>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /lib/libnss_dns.so.2
>>> Core was generated by `/usr/local/cm/bin/ccm'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>> 
>>> 
>>> backtrace - CUCM
>>> ===
>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
>>> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
>>> (this=0xd7d74f8, severity=6, strBuf=) at
>>> AlternateSyslogAlarmMonitor.cpp:162
>>> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
>>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
>>> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
>>> OS_Thread_Adapter.cpp:103
>>> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
>>> Base_Thread_Adapter.cpp:126
>>> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
>>> #7  0xf6a97c1e in clone () from /lib/libc.so.6
>>> 
>>> info registers
>>> 
>>> eax0x198408
>>> ecx0xf6b483d0-155941936
>>> edx0xf6b483a0-155941984
>>> ebx0xf6b46ff4-155947020
>>> esp0xf5d0ab100xf5d0ab10
>>> ebp0xf5d0ab980xf5d0ab98
>>> esi 

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-22 Thread Erick Bergquist
Normal CPU usage up to core, then a CPU spike right after core when
things are starting back up. Different hosts, same storage.

Only gap in trace file line entries is 23:54:59.957 to 23:55:18.402
but maybe no activities?

11574290.000 |23:54:59.957 |SdlStat  |Period: 6s #Lines: 19 #Bytes:
3701 Total Number of Buffers: 1 Free LWM: 9989 Free LWM(total):
9192
11574291.000 |23:55:18.402 |AppInfo
|SIPSocketProtocol(2,100,14,137)::handleReadComplete send SdlReadRsp:
size 370
11574292.000 |23:55:18.402 |SdlStat  |Period: 19s #Lines: 1 #Bytes:
117 Total Number of Buffers: 1 Free LWM: 9998 Free LWM(total):
9192
11574293.000 |23:55:18.959 |AppInfo  |CMProcMon - ERROR   SDL
Router Services Dead: timeBeforeSleep: 804509608, currentMontiorTime:
804530608, deltaTime: 21000, LoopCount= 19

The deltaTime is 21000 which would be more then the 20 seconds for the
router thread in the previously mentioned document covering 3.x/4.x.

Do these services shutdown and restart themselves if over 16 seconds
(SDL Timer) or 20 seconds (Router Thread) still in 10.x like that
document states at those same thresholds?

VMWare KB suggests the SCSI timeout should be increased to handle
transient conditions. I would think these timers should be able to
handle transient issues without failing the service?  Maybe the
thresholds need to be increased to handle these periods.

http://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=1009465

I've only seen this with the ccm and ctimanager service. None of the
other services.

Thanks.


On Fri, Jan 22, 2016 at 2:44 PM, Ryan Ratliff (rratliff)
 wrote:
> How about CPU usage leading up to the cores?
> If you have CCM traces from before are there gaps in the timestamps between 
> line numbers?
>
> If you are virtualized were those two nodes on the same host?  If so did 
> other VMs on that host experience any odd issues at the time?
> -Ryan
>
> On Jan 22, 2016, at 3:29 PM, Erick Bergquist  wrote:
>
> Ryan,
>
> I'm seeing a small spike in IOWait but not for that long, No gaps in
> the Perfmon logs.  Only the ccm service core dumped on both the call
> manager nodes on this cluster at the same time.
>
> Would the system shutdown the CCM and CTI Manager service itself if
> these timers are to long, like that document from CCM 3.x/4.x days
> eludes to with the 16 second SDL Timer Thread and 20 Second Router
> Thread threshold?
>
>
> The subscriber had a different backtrace, Intentional Abort..
>
> 
> backtrace - CUCM
> ===
> #0  0xf776c430 in __kernel_vsyscall ()
> #1  0xf697b871 in raise () from /lib/libc.so.6
> #2  0xf697d14a in abort () from /lib/libc.so.6
> #3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
> #4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
> #5  0x083762ca in CMProcMon::callManagerMonitorThread
> (cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
> #6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
> OS_Thread_Adapter.cpp:103
> #7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
> Base_Thread_Adapter.cpp:126
> #8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
> #9  0xf6a33c1e in clone () from /lib/libc.so.6
> 
> info registers
> 
>
>
> Thanks,
> Erick
>
>
> On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
>  wrote:
>> That backtrace is usually accompanied by CTIManager crashing due to 
>> intentional abort.
>>
>>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>>> - Test Timer exceeded minimum timer latency threshold of 1000
>>> milliseconds, Actual latency: 54360 milliseconds
>>
>> That is quite spectacularly bad.  The SDLTimer thread is responsible for 
>> _every_ timer the ccm process uses. It has a one-second self-check to make 
>> sure it’s getting enough cpu time and in your case it didn’t get to execute 
>> it’s one-second check for 54 seconds.
>>
>> This was definitely performance related so check CPU usage and IOWait at the 
>> time (+- a minute or so). Gaps in the RISDC perfmon logs would also indicate 
>> performance issues, though leave you with little evidence unless it happened 
>> to catch some process at the beginning of a spike.
>>
>> -Ryan
>>
>> On Jan 22, 2016, at 1:13 PM, Erick Bergquist  wrote:
>>
>> Brian,
>>
>> Looked over those and not seeing any processes consuming lots of
>> memory, memory used was 70-75% during this time.
>>
>>
>> On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade  wrote:
>>> Did you download Perfmon logs?  Might just be high memory in general causing
>>> the issue.  I'd use the Perfmon logs to check what processes were using up
>>> all the memory at that time.
>>>
>>> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist 
>>> wrote:

 This happened on 

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-22 Thread Erick Bergquist
Ryan,

I'm seeing a small spike in IOWait but not for that long, No gaps in
the Perfmon logs.  Only the ccm service core dumped on both the call
manager nodes on this cluster at the same time.

Would the system shutdown the CCM and CTI Manager service itself if
these timers are to long, like that document from CCM 3.x/4.x days
eludes to with the 16 second SDL Timer Thread and 20 Second Router
Thread threshold?


The subscriber had a different backtrace, Intentional Abort..

 
 backtrace - CUCM
 ===
 #0  0xf776c430 in __kernel_vsyscall ()
#1  0xf697b871 in raise () from /lib/libc.so.6
#2  0xf697d14a in abort () from /lib/libc.so.6
#3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
#4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
#5  0x083762ca in CMProcMon::callManagerMonitorThread
(cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
#6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
OS_Thread_Adapter.cpp:103
#7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
Base_Thread_Adapter.cpp:126
#8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
#9  0xf6a33c1e in clone () from /lib/libc.so.6
 
 info registers
 


Thanks,
Erick


On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
 wrote:
> That backtrace is usually accompanied by CTIManager crashing due to 
> intentional abort.
>
>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>> - Test Timer exceeded minimum timer latency threshold of 1000
>> milliseconds, Actual latency: 54360 milliseconds
>
> That is quite spectacularly bad.  The SDLTimer thread is responsible for 
> _every_ timer the ccm process uses. It has a one-second self-check to make 
> sure it’s getting enough cpu time and in your case it didn’t get to execute 
> it’s one-second check for 54 seconds.
>
> This was definitely performance related so check CPU usage and IOWait at the 
> time (+- a minute or so). Gaps in the RISDC perfmon logs would also indicate 
> performance issues, though leave you with little evidence unless it happened 
> to catch some process at the beginning of a spike.
>
> -Ryan
>
> On Jan 22, 2016, at 1:13 PM, Erick Bergquist  wrote:
>
> Brian,
>
> Looked over those and not seeing any processes consuming lots of
> memory, memory used was 70-75% during this time.
>
>
> On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade  wrote:
>> Did you download Perfmon logs?  Might just be high memory in general causing
>> the issue.  I'd use the Perfmon logs to check what processes were using up
>> all the memory at that time.
>>
>> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist 
>> wrote:
>>>
>>> This happened on CUCM 10.5.2.1-7 and I'm not finding a bug for
>>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release
>>> notes.
>>>
>>>
>>> From the core analysis,
>>>
>>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /lib/libnss_files.so.2
>>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /lib/libnss_dns.so.2
>>> Core was generated by `/usr/local/cm/bin/ccm'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>>
>>>  
>>> backtrace - CUCM
>>> ===
>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
>>> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
>>> (this=0xd7d74f8, severity=6, strBuf=) at
>>> AlternateSyslogAlarmMonitor.cpp:162
>>> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
>>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
>>> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
>>> OS_Thread_Adapter.cpp:103
>>> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
>>> Base_Thread_Adapter.cpp:126
>>> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
>>> #7  0xf6a97c1e in clone () from /lib/libc.so.6
>>> 
>>> info registers
>>> 
>>> eax0x198408
>>> ecx0xf6b483d0-155941936
>>> edx0xf6b483a0-155941984
>>> ebx0xf6b46ff4-155947020
>>> esp0xf5d0ab100xf5d0ab10
>>> ebp0xf5d0ab980xf5d0ab98
>>> esi0xf4bc0b8256622776
>>> edi0x00
>>> eip0xf6a28b550xf6a28b55 <_int_malloc+725>
>>> eflags 0x10212[ AF IF RF ]
>>> cs 0x2335
>>> ss 0x2b43
>>> ds 0x2b43
>>> es 0x2b43
>>> fs 0x00
>>> gs 0x6399
>>> 
>>> info threads

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-22 Thread Erick Bergquist
Brian,

Looked over those and not seeing any processes consuming lots of
memory, memory used was 70-75% during this time.


On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade  wrote:
> Did you download Perfmon logs?  Might just be high memory in general causing
> the issue.  I'd use the Perfmon logs to check what processes were using up
> all the memory at that time.
>
> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist 
> wrote:
>>
>> This happened on CUCM 10.5.2.1-7 and I'm not finding a bug for
>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release
>> notes.
>>
>>
>> From the core analysis,
>>
>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
>> found)...done.
>> Loaded symbols for /lib/libnss_files.so.2
>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
>> found)...done.
>> Loaded symbols for /lib/libnss_dns.so.2
>> Core was generated by `/usr/local/cm/bin/ccm'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>
>>   
>>  backtrace - CUCM
>>  ===
>>  #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
>> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
>> (this=0xd7d74f8, severity=6, strBuf=) at
>> AlternateSyslogAlarmMonitor.cpp:162
>> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
>> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
>> OS_Thread_Adapter.cpp:103
>> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
>> Base_Thread_Adapter.cpp:126
>> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
>> #7  0xf6a97c1e in clone () from /lib/libc.so.6
>>  
>>  info registers
>>  
>>  eax0x198408
>> ecx0xf6b483d0-155941936
>> edx0xf6b483a0-155941984
>> ebx0xf6b46ff4-155947020
>> esp0xf5d0ab100xf5d0ab10
>> ebp0xf5d0ab980xf5d0ab98
>> esi0xf4bc0b8256622776
>> edi0x00
>> eip0xf6a28b550xf6a28b55 <_int_malloc+725>
>> eflags 0x10212[ AF IF RF ]
>> cs 0x2335
>> ss 0x2b43
>> ds 0x2b43
>> es 0x2b43
>> fs 0x00
>> gs 0x6399
>>  
>>  info threads
>>  
>>47 Thread 0xe750bb70 (LWP 30608)  0xf6a237f4 in _IO_flush_all_lockp
>> () from /lib/libc.so.6
>>   46 Thread 0xf5effb70 (LWP 29077)  0xf77d0430 in __kernel_vsyscall ()
>>   45 Thread 0xf63ffb70 (LWP 29052)  0xf77d0430 in __kernel_vsyscall ()
>>   44 Thread 0xf67d0b70 (LWP 29048)  0xf77d0430 in __kernel_vsyscall ()
>>   43 Thread 0xf67e0710 (LWP 25984)  0xf77d0430 in __kernel_vsyscall ()
>>   42 Thread 0xe6f2fb70 (LWP 30611)  0xf77d0430 in __kernel_vsyscall ()
>>
>>
>> No errors/etc in the Syslog at the time other then errors on the call
>> manager service failing.
>>
>> However, the SDL CCM traces show of a timer expiring and appears it
>> shuts down services due to the timer issue.
>>
>> From the following cisco doc,
>>
>> http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html
>> it looks like the SDL timer thread stopped and service shutdown?
>> This doc is from CCM 3.x and 4.x time frame.  Are the timers mentioned
>> in this doc still true for 10.5 - where it shuts down at 16 seconds
>> for SDL timer thread or 20 seconds for SDL router thread.
>>
>>
>> 09465137.000 |23:55:15.416 |Stopping |
>>   |   |SdlTCPConnection(1,100,14,37)
>>  |SdlTCPConnection(1,100,14,37)|
>>   |NumOfCurrentInstances: 39
>>
>> 09465138.000 |23:55:15.448 |SdlSig   |SdlVerificationTimer
>>   |monitoring |CMProcMon(1,100,234,1)
>>  |SdlTimerService(1,100,3,1)   |1,100,234,1.3139616^*^*
>>   |*TraceFlagOverrode
>>
>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>> - Test Timer exceeded minimum timer latency threshold of 1000
>> milliseconds, Actual latency: 54360 milliseconds
>>
>> 09465139.000 |23:55:16.135 |AppInfo  |CallManagerFailure - Indicates
>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
>> Node IP Address:192.168.1.5 Reason code:4 Additional Text: App
>> ID:Cisco CallManager Cluster ID:StandAloneCluster Node ID:CUCM1
>>
>> 09465140.000 |23:55:16.135 |AlarmErr |AlarmClass: CallManager,
>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
>> , AlarmDescription: Indicates an internal failure in Unified CM,
>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
>> Text:, AppID:Cisco CallManager, 

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-22 Thread Ryan Ratliff (rratliff)
That backtrace is usually accompanied by CTIManager crashing due to intentional 
abort. 

> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
> - Test Timer exceeded minimum timer latency threshold of 1000
> milliseconds, Actual latency: 54360 milliseconds

That is quite spectacularly bad.  The SDLTimer thread is responsible for 
_every_ timer the ccm process uses. It has a one-second self-check to make sure 
it’s getting enough cpu time and in your case it didn’t get to execute it’s 
one-second check for 54 seconds. 

This was definitely performance related so check CPU usage and IOWait at the 
time (+- a minute or so). Gaps in the RISDC perfmon logs would also indicate 
performance issues, though leave you with little evidence unless it happened to 
catch some process at the beginning of a spike.

-Ryan

On Jan 22, 2016, at 1:13 PM, Erick Bergquist  wrote:

Brian,

Looked over those and not seeing any processes consuming lots of
memory, memory used was 70-75% during this time.


On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade  wrote:
> Did you download Perfmon logs?  Might just be high memory in general causing
> the issue.  I'd use the Perfmon logs to check what processes were using up
> all the memory at that time.
> 
> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist 
> wrote:
>> 
>> This happened on CUCM 10.5.2.1-7 and I'm not finding a bug for
>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release
>> notes.
>> 
>> 
>> From the core analysis,
>> 
>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
>> found)...done.
>> Loaded symbols for /lib/libnss_files.so.2
>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
>> found)...done.
>> Loaded symbols for /lib/libnss_dns.so.2
>> Core was generated by `/usr/local/cm/bin/ccm'.
>> Program terminated with signal 11, Segmentation fault.
>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>> 
>>  
>> backtrace - CUCM
>> ===
>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
>> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
>> (this=0xd7d74f8, severity=6, strBuf=) at
>> AlternateSyslogAlarmMonitor.cpp:162
>> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
>> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
>> OS_Thread_Adapter.cpp:103
>> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
>> Base_Thread_Adapter.cpp:126
>> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
>> #7  0xf6a97c1e in clone () from /lib/libc.so.6
>> 
>> info registers
>> 
>> eax0x198408
>> ecx0xf6b483d0-155941936
>> edx0xf6b483a0-155941984
>> ebx0xf6b46ff4-155947020
>> esp0xf5d0ab100xf5d0ab10
>> ebp0xf5d0ab980xf5d0ab98
>> esi0xf4bc0b8256622776
>> edi0x00
>> eip0xf6a28b550xf6a28b55 <_int_malloc+725>
>> eflags 0x10212[ AF IF RF ]
>> cs 0x2335
>> ss 0x2b43
>> ds 0x2b43
>> es 0x2b43
>> fs 0x00
>> gs 0x6399
>> 
>> info threads
>> 
>>   47 Thread 0xe750bb70 (LWP 30608)  0xf6a237f4 in _IO_flush_all_lockp
>> () from /lib/libc.so.6
>>  46 Thread 0xf5effb70 (LWP 29077)  0xf77d0430 in __kernel_vsyscall ()
>>  45 Thread 0xf63ffb70 (LWP 29052)  0xf77d0430 in __kernel_vsyscall ()
>>  44 Thread 0xf67d0b70 (LWP 29048)  0xf77d0430 in __kernel_vsyscall ()
>>  43 Thread 0xf67e0710 (LWP 25984)  0xf77d0430 in __kernel_vsyscall ()
>>  42 Thread 0xe6f2fb70 (LWP 30611)  0xf77d0430 in __kernel_vsyscall ()
>> 
>> 
>> No errors/etc in the Syslog at the time other then errors on the call
>> manager service failing.
>> 
>> However, the SDL CCM traces show of a timer expiring and appears it
>> shuts down services due to the timer issue.
>> 
>> From the following cisco doc,
>> 
>> http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html
>> it looks like the SDL timer thread stopped and service shutdown?
>> This doc is from CCM 3.x and 4.x time frame.  Are the timers mentioned
>> in this doc still true for 10.5 - where it shuts down at 16 seconds
>> for SDL timer thread or 20 seconds for SDL router thread.
>> 
>> 
>> 09465137.000 |23:55:15.416 |Stopping |
>>  |   |SdlTCPConnection(1,100,14,37)
>> |SdlTCPConnection(1,100,14,37)|
>>  |NumOfCurrentInstances: 39
>> 
>> 09465138.000 |23:55:15.448 |SdlSig   |SdlVerificationTimer
>>  |monitoring 

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-22 Thread Ryan Ratliff (rratliff)
That's your culprit right there. That 16 second sdlrouter thread timer is one 
of the few things I bet hadn't changed since 4.x. 

Can you look at esxi stats for that time period to see what the host thought 
was going on?

-Ryan

> On Jan 22, 2016, at 4:45 PM, Erick Bergquist  wrote:
> 
> Normal CPU usage up to core, then a CPU spike right after core when
> things are starting back up. Different hosts, same storage.
> 
> Only gap in trace file line entries is 23:54:59.957 to 23:55:18.402
> but maybe no activities?
> 
> 11574290.000 |23:54:59.957 |SdlStat  |Period: 6s #Lines: 19 #Bytes:
> 3701 Total Number of Buffers: 1 Free LWM: 9989 Free LWM(total):
> 9192
> 11574291.000 |23:55:18.402 |AppInfo
> |SIPSocketProtocol(2,100,14,137)::handleReadComplete send SdlReadRsp:
> size 370
> 11574292.000 |23:55:18.402 |SdlStat  |Period: 19s #Lines: 1 #Bytes:
> 117 Total Number of Buffers: 1 Free LWM: 9998 Free LWM(total):
> 9192
> 11574293.000 |23:55:18.959 |AppInfo  |CMProcMon - ERROR   SDL
> Router Services Dead: timeBeforeSleep: 804509608, currentMontiorTime:
> 804530608, deltaTime: 21000, LoopCount= 19
> 
> The deltaTime is 21000 which would be more then the 20 seconds for the
> router thread in the previously mentioned document covering 3.x/4.x.
> 
> Do these services shutdown and restart themselves if over 16 seconds
> (SDL Timer) or 20 seconds (Router Thread) still in 10.x like that
> document states at those same thresholds?
> 
> VMWare KB suggests the SCSI timeout should be increased to handle
> transient conditions. I would think these timers should be able to
> handle transient issues without failing the service?  Maybe the
> thresholds need to be increased to handle these periods.
> 
> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US=displayKC=1009465
> 
> I've only seen this with the ccm and ctimanager service. None of the
> other services.
> 
> Thanks.
> 
> 
> On Fri, Jan 22, 2016 at 2:44 PM, Ryan Ratliff (rratliff)
>  wrote:
>> How about CPU usage leading up to the cores?
>> If you have CCM traces from before are there gaps in the timestamps between 
>> line numbers?
>> 
>> If you are virtualized were those two nodes on the same host?  If so did 
>> other VMs on that host experience any odd issues at the time?
>> -Ryan
>> 
>> On Jan 22, 2016, at 3:29 PM, Erick Bergquist  wrote:
>> 
>> Ryan,
>> 
>> I'm seeing a small spike in IOWait but not for that long, No gaps in
>> the Perfmon logs.  Only the ccm service core dumped on both the call
>> manager nodes on this cluster at the same time.
>> 
>> Would the system shutdown the CCM and CTI Manager service itself if
>> these timers are to long, like that document from CCM 3.x/4.x days
>> eludes to with the 16 second SDL Timer Thread and 20 Second Router
>> Thread threshold?
>> 
>> 
>> The subscriber had a different backtrace, Intentional Abort..
>> 
>> 
>> backtrace - CUCM
>> ===
>> #0  0xf776c430 in __kernel_vsyscall ()
>> #1  0xf697b871 in raise () from /lib/libc.so.6
>> #2  0xf697d14a in abort () from /lib/libc.so.6
>> #3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
>> #4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
>> #5  0x083762ca in CMProcMon::callManagerMonitorThread
>> (cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
>> #6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
>> OS_Thread_Adapter.cpp:103
>> #7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
>> Base_Thread_Adapter.cpp:126
>> #8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
>> #9  0xf6a33c1e in clone () from /lib/libc.so.6
>> 
>> info registers
>> 
>> 
>> 
>> Thanks,
>> Erick
>> 
>> 
>> On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
>>  wrote:
>>> That backtrace is usually accompanied by CTIManager crashing due to 
>>> intentional abort.
>>> 
 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
 - Test Timer exceeded minimum timer latency threshold of 1000
 milliseconds, Actual latency: 54360 milliseconds
>>> 
>>> That is quite spectacularly bad.  The SDLTimer thread is responsible for 
>>> _every_ timer the ccm process uses. It has a one-second self-check to make 
>>> sure it’s getting enough cpu time and in your case it didn’t get to execute 
>>> it’s one-second check for 54 seconds.
>>> 
>>> This was definitely performance related so check CPU usage and IOWait at 
>>> the time (+- a minute or so). Gaps in the RISDC perfmon logs would also 
>>> indicate performance issues, though leave you with little evidence unless 
>>> it happened to catch some process at the beginning of a spike.
>>> 
>>> -Ryan
>>> 
>>> On Jan 22, 2016, at 1:13 PM, Erick Bergquist  wrote:
>>> 
>>> Brian,
>>> 
>>> Looked over those and 

Re: [cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-22 Thread Brian Meade
Did you download Perfmon logs?  Might just be high memory in general
causing the issue.  I'd use the Perfmon logs to check what processes were
using up all the memory at that time.

On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist 
wrote:

> This happened on CUCM 10.5.2.1-7 and I'm not finding a bug for
> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release
> notes.
>
>
> From the core analysis,
>
> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib/libnss_files.so.2
> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
> found)...done.
> Loaded symbols for /lib/libnss_dns.so.2
> Core was generated by `/usr/local/cm/bin/ccm'.
> Program terminated with signal 11, Segmentation fault.
> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>
>   
>  backtrace - CUCM
>  ===
>  #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
> (this=0xd7d74f8, severity=6, strBuf=) at
> AlternateSyslogAlarmMonitor.cpp:162
> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
> OS_Thread_Adapter.cpp:103
> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
> Base_Thread_Adapter.cpp:126
> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
> #7  0xf6a97c1e in clone () from /lib/libc.so.6
>  
>  info registers
>  
>  eax0x198408
> ecx0xf6b483d0-155941936
> edx0xf6b483a0-155941984
> ebx0xf6b46ff4-155947020
> esp0xf5d0ab100xf5d0ab10
> ebp0xf5d0ab980xf5d0ab98
> esi0xf4bc0b8256622776
> edi0x00
> eip0xf6a28b550xf6a28b55 <_int_malloc+725>
> eflags 0x10212[ AF IF RF ]
> cs 0x2335
> ss 0x2b43
> ds 0x2b43
> es 0x2b43
> fs 0x00
> gs 0x6399
>  
>  info threads
>  
>47 Thread 0xe750bb70 (LWP 30608)  0xf6a237f4 in _IO_flush_all_lockp
> () from /lib/libc.so.6
>   46 Thread 0xf5effb70 (LWP 29077)  0xf77d0430 in __kernel_vsyscall ()
>   45 Thread 0xf63ffb70 (LWP 29052)  0xf77d0430 in __kernel_vsyscall ()
>   44 Thread 0xf67d0b70 (LWP 29048)  0xf77d0430 in __kernel_vsyscall ()
>   43 Thread 0xf67e0710 (LWP 25984)  0xf77d0430 in __kernel_vsyscall ()
>   42 Thread 0xe6f2fb70 (LWP 30611)  0xf77d0430 in __kernel_vsyscall ()
>
>
> No errors/etc in the Syslog at the time other then errors on the call
> manager service failing.
>
> However, the SDL CCM traces show of a timer expiring and appears it
> shuts down services due to the timer issue.
>
> From the following cisco doc,
>
> http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html
> it looks like the SDL timer thread stopped and service shutdown?
> This doc is from CCM 3.x and 4.x time frame.  Are the timers mentioned
> in this doc still true for 10.5 - where it shuts down at 16 seconds
> for SDL timer thread or 20 seconds for SDL router thread.
>
>
> 09465137.000 |23:55:15.416 |Stopping |
>   |   |SdlTCPConnection(1,100,14,37)
>  |SdlTCPConnection(1,100,14,37)|
>   |NumOfCurrentInstances: 39
>
> 09465138.000 |23:55:15.448 |SdlSig   |SdlVerificationTimer
>   |monitoring |CMProcMon(1,100,234,1)
>  |SdlTimerService(1,100,3,1)   |1,100,234,1.3139616^*^*
>   |*TraceFlagOverrode
>
> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
> - Test Timer exceeded minimum timer latency threshold of 1000
> milliseconds, Actual latency: 54360 milliseconds
>
> 09465139.000 |23:55:16.135 |AppInfo  |CallManagerFailure - Indicates
> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
> Node IP Address:192.168.1.5 Reason code:4 Additional Text: App
> ID:Cisco CallManager Cluster ID:StandAloneCluster Node ID:CUCM1
>
> 09465140.000 |23:55:16.135 |AlarmErr |AlarmClass: CallManager,
> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
> , AlarmDescription: Indicates an internal failure in Unified CM,
> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
> Text:, AppID:Cisco CallManager, ClusterID:StandAloneCluster,
> NodeID:CUCM1
>
> 09465141.000 |23:55:17.465 |AppInfo  |CallManagerFailure - Indicates
> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
> Node IP Address:192.168.1.5 Reason code:4 Additional Text:CCM
> Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
> 

[cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

2016-01-22 Thread Erick Bergquist
This happened on CUCM 10.5.2.1-7 and I'm not finding a bug for
this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release
notes.


>From the core analysis,

Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libnss_files.so.2
Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_dns.so.2
Core was generated by `/usr/local/cm/bin/ccm'.
Program terminated with signal 11, Segmentation fault.
#0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6

  
 backtrace - CUCM
 ===
 #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
#1  0xf6a29d9e in malloc () from /lib/libc.so.6
#2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
(this=0xd7d74f8, severity=6, strBuf=) at
AlternateSyslogAlarmMonitor.cpp:162
#3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
(pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
#4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
OS_Thread_Adapter.cpp:103
#5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
Base_Thread_Adapter.cpp:126
#6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
#7  0xf6a97c1e in clone () from /lib/libc.so.6
 
 info registers
 
 eax0x198408
ecx0xf6b483d0-155941936
edx0xf6b483a0-155941984
ebx0xf6b46ff4-155947020
esp0xf5d0ab100xf5d0ab10
ebp0xf5d0ab980xf5d0ab98
esi0xf4bc0b8256622776
edi0x00
eip0xf6a28b550xf6a28b55 <_int_malloc+725>
eflags 0x10212[ AF IF RF ]
cs 0x2335
ss 0x2b43
ds 0x2b43
es 0x2b43
fs 0x00
gs 0x6399
 
 info threads
 
   47 Thread 0xe750bb70 (LWP 30608)  0xf6a237f4 in _IO_flush_all_lockp
() from /lib/libc.so.6
  46 Thread 0xf5effb70 (LWP 29077)  0xf77d0430 in __kernel_vsyscall ()
  45 Thread 0xf63ffb70 (LWP 29052)  0xf77d0430 in __kernel_vsyscall ()
  44 Thread 0xf67d0b70 (LWP 29048)  0xf77d0430 in __kernel_vsyscall ()
  43 Thread 0xf67e0710 (LWP 25984)  0xf77d0430 in __kernel_vsyscall ()
  42 Thread 0xe6f2fb70 (LWP 30611)  0xf77d0430 in __kernel_vsyscall ()


No errors/etc in the Syslog at the time other then errors on the call
manager service failing.

However, the SDL CCM traces show of a timer expiring and appears it
shuts down services due to the timer issue.

>From the following cisco doc,
http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html
it looks like the SDL timer thread stopped and service shutdown?
This doc is from CCM 3.x and 4.x time frame.  Are the timers mentioned
in this doc still true for 10.5 - where it shuts down at 16 seconds
for SDL timer thread or 20 seconds for SDL router thread.


09465137.000 |23:55:15.416 |Stopping |
  |   |SdlTCPConnection(1,100,14,37)
 |SdlTCPConnection(1,100,14,37)|
  |NumOfCurrentInstances: 39

09465138.000 |23:55:15.448 |SdlSig   |SdlVerificationTimer
  |monitoring |CMProcMon(1,100,234,1)
 |SdlTimerService(1,100,3,1)   |1,100,234,1.3139616^*^*
  |*TraceFlagOverrode

09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
- Test Timer exceeded minimum timer latency threshold of 1000
milliseconds, Actual latency: 54360 milliseconds

09465139.000 |23:55:16.135 |AppInfo  |CallManagerFailure - Indicates
an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
Node IP Address:192.168.1.5 Reason code:4 Additional Text: App
ID:Cisco CallManager Cluster ID:StandAloneCluster Node ID:CUCM1

09465140.000 |23:55:16.135 |AlarmErr |AlarmClass: CallManager,
AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
, AlarmDescription: Indicates an internal failure in Unified CM,
AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
Text:, AppID:Cisco CallManager, ClusterID:StandAloneCluster,
NodeID:CUCM1

09465141.000 |23:55:17.465 |AppInfo  |CallManagerFailure - Indicates
an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
Node IP Address:192.168.1.5 Reason code:4 Additional Text:CCM
Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
CMProcMon[1:100:234:1] App ID:Cisco CallManager Cluster
ID:StandAloneCluster Node ID:CUCM1

09465142.000 |23:55:17.465 |AlarmErr |AlarmClass: CallManager,
AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
, AlarmDescription: Indicates an internal failure in Unified CM,
AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
Text:CCM Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
CMProcMon[1:100:234:1], AppID:Cisco CallManager,