I suggest we close this defect ticket as "not reproducible".
That is unless someone can reproduce it.
My best guess is that this is yet another side effect of testing an overloaded 
system.

Since we have no load regulation mechanism and no overload protection mechanism,
it is relatively easy to push the system until it starts to break down. This is 
what is hapening here.
The IMMND misses a timeloy response on a helathchek heartbeat.
Such a heartbeat timeout I expect (hope) is set to 2 or 3 minutes.

The *only* reason for the healthcheck existence is to detect and restart a 
hung/looping process.
If a process is hunbg or looping it will be so indefinitely. So we dont want 
false positives shooting
down the service just because the system is temporarily overloaded. This just 
adds more load.

If a process has not repsonded in a few minutes then we really assume it is 
hung.
The assumption here is also that a really hung process is a very rare kind of 
problem.
This assumption is correct relative tho the IMMND because the IMMND only blocks
on calls to MDS and (ironically) on some syncronous AMF calls. 


---

** [tickets:#1291] IMM: IMMD healthcheck callback timeout when standby 
controller rebooted in middle of IMMND sync**

**Status:** assigned
**Milestone:** 4.5.2
**Created:** Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
**Last Updated:** Fri Apr 10, 2015 01:09 PM UTC
**Owner:** Neelakanta Reddy
**Attachments:**

- 
[immlogs.tar.bz2](https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2)
 (6.8 MB; application/x-bzip)


The issue is observed with 4.6 FC changeset 6377. The system is up and running 
with single pbe and 50k objects. This issue is seen after 
http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is 
running on standby controller and immcfg command is run from payload to set 
CompRestartMax value to 1000. IMMND is killed twice on standby controller 
leading to #1290.

As a result, standby controller left the cluster in middle of sync, IMMD 
reported healthcheck callback timeout and the active controller too went for 
reboot. Following is the syslog of SC-1:

Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id 
2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 
131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link 
<1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link 
<1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with 
<1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not 
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not 
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not 
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not 
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not 
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not 
sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link 
<1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics; 
dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', 
processed='center(queued)=2197', processed='center(received)=1172', 
processed='destination(messages)=1172', processed='destination(mailinfo)=0', 
processed='destination(mailwarn)=0', 
processed='destination(localmessages)=955', processed='destination(newserr)=0', 
processed='destination(mailerr)=0', processed='destination(netmgm)=0', 
processed='destination(warn)=44', processed='destination(console)=13', 
processed='destination(null)=0', processed='destination(mail)=0', 
processed='destination(xconsole)=13', processed='destination(firewall)=0', 
processed='destination(acpid)=0', processed='destination(newscrit)=0', 
processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on 
saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED 
status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE: 
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE-> 
IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting 
ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing 
with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation 
timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of 
'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated 
from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO 
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER 
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due 
to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node; 
timeout=60

syslog, immnd and immd traces of SC-1 attached.



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to