Re: [tickets] [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Sirisha Alla Wed, 19 Aug 2015 02:07:09 -0700

Yes, I tried this today. The healthcheck timeout happened on IMMD not onIMMND.


/Sirisha


On Wednesday 19 August 2015 02:28 PM, Anders Bjornerstedt wrote:

Changeset "6744" is generated today.
So I assume this means you reproduced this today.
The IMMND main poll handling processes in sequence on each descriptor,so it should not be possible
For traffic on one descriptor to "starve out" a job on another.

/AndersBj

From: Anders Bjornerstedt [mailto:[email protected]]
Sent: den 19 augusti 2015 10:54
To: [opensaf:tickets]
Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callbacktimeout when standby controller rebooted in middle of IMMND sync
Ok but then the question simply becomes why does the healthcheckcallback not reach the IMMND or why does the IMMND reply
not reach the AMFND ?

/AndersBj

From: Sirisha Alla [mailto:[email protected]]
Sent: den 19 augusti 2015 10:50
To: [opensaf:tickets]
Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callbacktimeout when standby controller rebooted in middle of IMMND sync
This issue is reproduced on changeset 6744. Syslog as follows:
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer forclass 'SaSmfSwBundle' is safSmfService => class extent is safe.Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned'safSi=SC-2N,safApp=OpenSAF' ACTIVE to'safSu=SC-1,safSg=2N,safApp=OpenSAF'Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) servicessuccessfully startedAug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfullyannounced dump at node 2010f. New Epoch:27
......
Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC:Established link <1.1.1:eth0-1.1.2:eth0> on network plane AAug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND processis on STANDBY Controller at 2020fAug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended introfrom node 2020fAug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND oncontroller (not currently coord) requests syncAug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f requestsync sync-pid:5221 epoch:0Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync,epoch:30Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE:IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVERAug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->IMM_NODE_R_AVAILABLEAug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfullyannounced sync. New ruling epoch:30Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabledto file /var/log/opensaf/osafimmnd, mask=0xffffffff
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' leftthe clusterAug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 wentdown. Not sending track callback for agents on that nodeAug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 wentdown. Not sending track callback for agents on that nodeAug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discardnode received for nodeId:2030f pid:16584Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementerdisconnected 15 <0, 2030f(down)> (MsgQueueService131855)Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC:Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not respondingAug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lostlink <1.1.1:eth0-1.1.3:eth0> on network plane AAug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lostcontact with <1.1.3>Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC:Established link <1.1.1:eth0-1.1.3:eth0> on network plane AAug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAINon saImmOmSearchNext - abortingAug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLYFAILED status:1Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE:IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READYAug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->IMM_NODE_FULLY_AVAILABLE (2484)Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30in ImmModelAug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coordbroadcasting ABORT_SYNC, epoch:30Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30committing with ccbId:100000006/4294967302Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC:Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not respondingAug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lostlink <1.1.1:eth0-1.1.3:eth0> on network plane AAug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lostcontact with <1.1.3>Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994appears stuck on runtime data handling - sending SIGTERMAug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE receivedSIG_TERM, closing db handleAug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE processEXITING...Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementerlocally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistentback-end process has apparently died.Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coordbroadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NOImmModel::getPbeOi reports missing PbeOi locally => unsafeAug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coordbroadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NOImmModel::getPbeOi reports missing PbeOi locally => unsafe
.....
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NOImmModel::getPbeOi reports missing PbeOi locally => unsafeAug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coordbroadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failoverprobation timer started (timeout: 1200000000000 ns)Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performingfailover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery actionescalated from 'componentFailover' to 'suFailover'Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to'healthCheckcallbackTimeout' : Recovery is 'suFailover'Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ERsafComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted dueto:healthCheckcallbackTimeout Recovery is:suFailoverAug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAFNodeId = 131343 EE Name = , Reason: Component faulted: recovery isnode failfast, OwnNodeId = 131343, SupervisionTime = 60
In the above logs, is this the reason for IMMND hanging for 3 minutes?
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994appears stuck on runtime data handling - sending SIGTERMAug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE receivedSIG_TERM, closing db handleAug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE processEXITING...
------------------------------------------------------------------------
[tickets:#1291]<http://sourceforge.net/p/opensaf/tickets/1291/>http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/IMM: IMMD healthcheck callback timeout when standby controllerrebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:40 AM UTC
Owner: Neelakanta Reddy
Attachments:

  * 
immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
    (6.8 MB; application/x-bzip)
The issue is observed with 4.6 FC changeset 6377. The system is up andrunning with single pbe and 50k objects. This issue is seen afterhttp://sourceforge.net/p/opensaf/tickets/1290 is observed. IMMapplication is running on standby controller and immcfg command is runfrom payload to set CompRestartMax value to 1000. IMMND is killedtwice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync,IMMD reported healthcheck callback timeout and the active controllertoo went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event fornode id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAFNodeId = 131599 EE Name = , Reason: Received Node Down for peercontroller, OwnNodeId = 131343, SupervisionTime = 60Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not respondingMar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lostlink <1.1.1:eth0-1.1.2:eth0> on network plane AMar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lostcontact with <1.1.2>Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' leftthe clusterMar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote nodein the absence of PLM is outside the scope of OpenSAFMar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:Established link <1.1.1:eth0-1.1.2:eth0> on network plane AMar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',processed='center(queued)=2197', processed='center(received)=1172',processed='destination(messages)=1172',processed='destination(mailinfo)=0',processed='destination(mailwarn)=0',processed='destination(localmessages)=955',processed='destination(newserr)=0',processed='destination(mailerr)=0', processed='destination(netmgm)=0',processed='destination(warn)=44', processed='destination(console)=13',processed='destination(null)=0', processed='destination(mail)=0',processed='destination(xconsole)=13',processed='destination(firewall)=0', processed='destination(acpid)=0',processed='destination(newscrit)=0',processed='destination(newsnotice)=0', processed='source(src)=1172'Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAINon saImmOmSearchNext - abortingMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLYFAILED status:1Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READYMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->IMM_NODE_FULLY_AVAILABLE (2484)Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12in ImmModelMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coordbroadcasting ABORT_SYNC, epoch:12Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12committing with ccbId:100000054/4294967380Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failoverprobation timer started (timeout: 1200000000000 ns)Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performingfailover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery actionescalated from 'componentFailover' to 'suFailover'Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to'healthCheckcallbackTimeout' : Recovery is 'suFailover'Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ERsafComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted dueto:healthCheckcallbackTimeout Recovery is:suFailoverMar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAFNodeId = 131343 EE Name = , Reason: Component faulted: recovery isnode failfast, OwnNodeId = 131343, SupervisionTime = 60Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;timeout=60
syslog, immnd and immd traces of SC-1 attached.

------------------------------------------------------------------------
Sent from sourceforge.net because you indicated interest inhttps://sourceforge.net/p/opensaf/tickets/1291/
To unsubscribe from further messages, please visithttps://sourceforge.net/auth/subscriptions/
------------------------------------------------------------------------
[tickets:#1291]<http://sourceforge.net/p/opensaf/tickets/1291/>http://sourceforge.net/p/opensaf/tickets/1291/IMM: IMMD healthcheck callback timeout when standby controllerrebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:49 AM UTC
Owner: Neelakanta Reddy
Attachments:

  * 
immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
    (6.8 MB; application/x-bzip)
The issue is observed with 4.6 FC changeset 6377. The system is up andrunning with single pbe and 50k objects. This issue is seen afterhttp://sourceforge.net/p/opensaf/tickets/1290 is observed. IMMapplication is running on standby controller and immcfg command is runfrom payload to set CompRestartMax value to 1000. IMMND is killedtwice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync,IMMD reported healthcheck callback timeout and the active controllertoo went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event fornode id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAFNodeId = 131599 EE Name = , Reason: Received Node Down for peercontroller, OwnNodeId = 131343, SupervisionTime = 60Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not respondingMar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lostlink <1.1.1:eth0-1.1.2:eth0> on network plane AMar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lostcontact with <1.1.2>Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' leftthe clusterMar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote nodein the absence of PLM is outside the scope of OpenSAFMar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:Established link <1.1.1:eth0-1.1.2:eth0> on network plane AMar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',processed='center(queued)=2197', processed='center(received)=1172',processed='destination(messages)=1172',processed='destination(mailinfo)=0',processed='destination(mailwarn)=0',processed='destination(localmessages)=955',processed='destination(newserr)=0',processed='destination(mailerr)=0', processed='destination(netmgm)=0',processed='destination(warn)=44', processed='destination(console)=13',processed='destination(null)=0', processed='destination(mail)=0',processed='destination(xconsole)=13',processed='destination(firewall)=0', processed='destination(acpid)=0',processed='destination(newscrit)=0',processed='destination(newsnotice)=0', processed='source(src)=1172'Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAINon saImmOmSearchNext - abortingMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLYFAILED status:1Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READYMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->IMM_NODE_FULLY_AVAILABLE (2484)Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12in ImmModelMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coordbroadcasting ABORT_SYNC, epoch:12Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12committing with ccbId:100000054/4294967380Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failoverprobation timer started (timeout: 1200000000000 ns)Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performingfailover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery actionescalated from 'componentFailover' to 'suFailover'Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to'healthCheckcallbackTimeout' : Recovery is 'suFailover'Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ERsafComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted dueto:healthCheckcallbackTimeout Recovery is:suFailoverMar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAFNodeId = 131343 EE Name = , Reason: Component faulted: recovery isnode failfast, OwnNodeId = 131343, SupervisionTime = 60Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;timeout=60
syslog, immnd and immd traces of SC-1 attached.

------------------------------------------------------------------------
Sent from sourceforge.net because you indicated interest inhttps://sourceforge.net/p/opensaf/tickets/1291/
To unsubscribe from further messages, please visithttps://sourceforge.net/auth/subscriptions/
------------------------------------------------------------------------
*[tickets:#1291] <http://sourceforge.net/p/opensaf/tickets/1291/> IMM:IMMD healthcheck callback timeout when standby controller rebooted inmiddle of IMMND sync*
*Status:* not-reproducible
*Milestone:* never
*Created:* Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
*Last Updated:* Wed Aug 19, 2015 08:49 AM UTC
*Owner:* Neelakanta Reddy
*Attachments:*

  * immlogs.tar.bz2
    <http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2>
    (6.8 MB; application/x-bzip)
The issue is observed with 4.6 FC changeset 6377. The system is up andrunning with single pbe and 50k objects. This issue is seen afterhttp://sourceforge.net/p/opensaf/tickets/1290 is observed. IMMapplication is running on standby controller and immcfg command is runfrom payload to set CompRestartMax value to 1000. IMMND is killedtwice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync,IMMD reported healthcheck callback timeout and the active controllertoo went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event fornode id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAFNodeId = 131599 EE Name = , Reason: Received Node Down for peercontroller, OwnNodeId = 131343, SupervisionTime = 60Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not respondingMar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lostlink <1.1.1:eth0-1.1.2:eth0> on network plane AMar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lostcontact with <1.1.2>Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 wentdown. Not sending track callback for agents on that nodeMar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' leftthe clusterMar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote nodein the absence of PLM is outside the scope of OpenSAFMar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:Established link <1.1.1:eth0-1.1.2:eth0> on network plane AMar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',processed='center(queued)=2197', processed='center(received)=1172',processed='destination(messages)=1172',processed='destination(mailinfo)=0',processed='destination(mailwarn)=0',processed='destination(localmessages)=955',processed='destination(newserr)=0',processed='destination(mailerr)=0', processed='destination(netmgm)=0',processed='destination(warn)=44', processed='destination(console)=13',processed='destination(null)=0', processed='destination(mail)=0',processed='destination(xconsole)=13',processed='destination(firewall)=0', processed='destination(acpid)=0',processed='destination(newscrit)=0',processed='destination(newsnotice)=0', processed='source(src)=1172'Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAINon saImmOmSearchNext - abortingMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLYFAILED status:1Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READYMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->IMM_NODE_FULLY_AVAILABLE (2484)Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12in ImmModelMar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coordbroadcasting ABORT_SYNC, epoch:12Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12committing with ccbId:100000054/4294967380Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failoverprobation timer started (timeout: 1200000000000 ns)Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performingfailover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery actionescalated from 'componentFailover' to 'suFailover'Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to'healthCheckcallbackTimeout' : Recovery is 'suFailover'Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ERsafComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted dueto:healthCheckcallbackTimeout Recovery is:suFailoverMar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAFNodeId = 131343 EE Name = , Reason: Component faulted: recovery isnode failfast, OwnNodeId = 131343, SupervisionTime = 60Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;timeout=60
syslog, immnd and immd traces of SC-1 attached.

------------------------------------------------------------------------
Sent from sourceforge.net because[email protected] is subscribed tohttp://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can changesettings at http://sourceforge.net/p/opensaf/admin/tickets/options.Or, if this is a mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------


_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

------------------------------------------------------------------------------

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Re: [tickets] [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback timeout when standby controller rebooted in middle of IMMND sync

Reply via email to