Changeset "6744" is generated today.
So I assume this means you reproduced this today.
The IMMND main poll handling processes in sequence on each descriptor,
so it should not be possible
For traffic on one descriptor to "starve out" a job on another.
/AndersBj
From: Anders Bjornerstedt [mailto:[email protected]]
Sent: den 19 augusti 2015 10:54
To: [opensaf:tickets]
Subject: [opensaf:tickets] Re: #1291 IMM: IMMD healthcheck callback
timeout when standby controller rebooted in middle of IMMND sync
Ok but then the question simply becomes why does the healthcheck
callback not reach the IMMND or why does the IMMND reply
not reach the AMFND ?
/AndersBj
From: Sirisha Alla [mailto:[email protected]]
Sent: den 19 augusti 2015 10:50
To: [opensaf:tickets]
Subject: [opensaf:tickets] #1291 IMM: IMMD healthcheck callback
timeout when standby controller rebooted in middle of IMMND sync
This issue is reproduced on changeset 6744. Syslog as follows:
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO implementer for
class 'SaSmfSwBundle' is safSmfService => class extent is safe.
Aug 19 11:54:13 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Assigned
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to
'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Aug 19 11:54:13 SLES-64BIT-SLOT1 opensafd: OpenSAF(4.7.M0 - ) services
successfully started
Aug 19 11:54:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
announced dump at node 2010f. New Epoch:27
......
Aug 19 12:00:12 SLES-64BIT-SLOT1 kernel: [ 4223.945761] TIPC:
Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO New IMMND process
is on STANDBY Controller at 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Extended intro
from node 2020f
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: WA IMMND on
controller (not currently coord) requests sync
Aug 19 12:00:13 SLES-64BIT-SLOT1 osafimmd[5958]: NO Node 2020f request
sync sync-pid:5221 epoch:0
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Announce sync,
epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO SERVER STATE:
IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
IMM_NODE_R_AVAILABLE
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmd[5958]: NO Successfully
announced sync. New ruling epoch:30
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: logtrace: trace enabled
to file /var/log/opensaf/osafimmnd, mask=0xffffffff
Aug 19 12:00:14 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafamfd[6044]: NO Node 'PL-3' left
the cluster
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafclmd[6025]: NO Node 131855 went
down. Not sending track callback for agents on that node
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Global discard
node received for nodeId:2030f pid:16584
Aug 19 12:00:15 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
disconnected 15 <0, 2030f(down)> (MsgQueueService131855)
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876089] TIPC:
Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.876098] TIPC: Lost
link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:00:20 SLES-64BIT-SLOT1 kernel: [ 4231.877196] TIPC: Lost
contact with <1.1.3>
Aug 19 12:00:46 SLES-64BIT-SLOT1 kernel: [ 4257.206593] TIPC:
Established link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
on saImmOmSearchNext - aborting
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: ER SYNC APPARENTLY
FAILED status:1
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO -SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE (2484)
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Epoch set to 30
in ImmModel
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
broadcasting ABORT_SYNC, epoch:30
Aug 19 12:01:58 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 30
committing with ccbId:100000006/4294967302
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964128] TIPC:
Resetting link <1.1.1:eth0-1.1.3:eth0>, peer not responding
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964145] TIPC: Lost
link <1.1.1:eth0-1.1.3:eth0> on network plane A
Aug 19 12:03:50 SLES-64BIT-SLOT1 kernel: [ 4441.964157] TIPC: Lost
contact with <1.1.3>
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
EXITING...
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Implementer
locally disconnected. Marking it as doomed 11 <316, 2010f> (OpenSafImmPBE)
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: WA Persistent
back-end process has apparently died.
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:04:29 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:04:30 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
ImmModel::getPbeOi reports missing PbeOi locally => unsafe
.....
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO
ImmModel::getPbeOi reports missing PbeOi locally => unsafe
Aug 19 12:05:13 SLES-64BIT-SLOT1 osafimmnd[5969]: NO Coord
broadcasting PBE_PRTO_PURGE_MUTATIONS, epoch:30
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO SU failover
probation timer started (timeout: 1200000000000 ns)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO Performing
failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
escalated from 'componentFailover' to 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: ER
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
Aug 19 12:05:14 SLES-64BIT-SLOT1 osafamfnd[6054]: Rebooting OpenSAF
NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
node failfast, OwnNodeId = 131343, SupervisionTime = 60
In the above logs, is this the reason for IMMND hanging for 3 minutes?
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmnd[5969]: WA PBE process 5994
appears stuck on runtime data handling - sending SIGTERM
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: NO IMM PBE received
SIG_TERM, closing db handle
Aug 19 12:04:28 SLES-64BIT-SLOT1 osafimmpbed: IN IMM PBE process
EXITING...
------------------------------------------------------------------------
[tickets:#1291]
<http://sourceforge.net/p/opensaf/tickets/1291/>http://sourceforge.net/p/opensaf/tickets/1291/http://sourceforge.net/p/opensaf/tickets/1291/
IMM: IMMD healthcheck callback timeout when standby controller
rebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:40 AM UTC
Owner: Neelakanta Reddy
Attachments:
*
immlogs.tar.bz2https://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
(6.8 MB; application/x-bzip)
The issue is observed with 4.6 FC changeset 6377. The system is up and
running with single pbe and 50k objects. This issue is seen after
http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
application is running on standby controller and immcfg command is run
from payload to set CompRestartMax value to 1000. IMMND is killed
twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync,
IMMD reported healthcheck callback timeout and the active controller
too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
NodeId = 131599 EE Name = , Reason: Received Node Down for peer
controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
processed='center(queued)=2197', processed='center(received)=1172',
processed='destination(messages)=1172',
processed='destination(mailinfo)=0',
processed='destination(mailwarn)=0',
processed='destination(localmessages)=955',
processed='destination(newserr)=0',
processed='destination(mailerr)=0', processed='destination(netmgm)=0',
processed='destination(warn)=44', processed='destination(console)=13',
processed='destination(null)=0', processed='destination(mail)=0',
processed='destination(xconsole)=13',
processed='destination(firewall)=0', processed='destination(acpid)=0',
processed='destination(newscrit)=0',
processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
timeout=60
syslog, immnd and immd traces of SC-1 attached.
------------------------------------------------------------------------
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/opensaf/tickets/1291/
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/
------------------------------------------------------------------------
[tickets:#1291]
<http://sourceforge.net/p/opensaf/tickets/1291/>http://sourceforge.net/p/opensaf/tickets/1291/
IMM: IMMD healthcheck callback timeout when standby controller
rebooted in middle of IMMND sync
Status: not-reproducible
Milestone: never
Created: Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
Last Updated: Wed Aug 19, 2015 08:49 AM UTC
Owner: Neelakanta Reddy
Attachments:
*
immlogs.tar.bz2http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2
(6.8 MB; application/x-bzip)
The issue is observed with 4.6 FC changeset 6377. The system is up and
running with single pbe and 50k objects. This issue is seen after
http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
application is running on standby controller and immcfg command is run
from payload to set CompRestartMax value to 1000. IMMND is killed
twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync,
IMMD reported healthcheck callback timeout and the active controller
too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
NodeId = 131599 EE Name = , Reason: Received Node Down for peer
controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
processed='center(queued)=2197', processed='center(received)=1172',
processed='destination(messages)=1172',
processed='destination(mailinfo)=0',
processed='destination(mailwarn)=0',
processed='destination(localmessages)=955',
processed='destination(newserr)=0',
processed='destination(mailerr)=0', processed='destination(netmgm)=0',
processed='destination(warn)=44', processed='destination(console)=13',
processed='destination(null)=0', processed='destination(mail)=0',
processed='destination(xconsole)=13',
processed='destination(firewall)=0', processed='destination(acpid)=0',
processed='destination(newscrit)=0',
processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
timeout=60
syslog, immnd and immd traces of SC-1 attached.
------------------------------------------------------------------------
Sent from sourceforge.net because you indicated interest in
https://sourceforge.net/p/opensaf/tickets/1291/
To unsubscribe from further messages, please visit
https://sourceforge.net/auth/subscriptions/
------------------------------------------------------------------------
*[tickets:#1291] <http://sourceforge.net/p/opensaf/tickets/1291/> IMM:
IMMD healthcheck callback timeout when standby controller rebooted in
middle of IMMND sync*
*Status:* not-reproducible
*Milestone:* never
*Created:* Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
*Last Updated:* Wed Aug 19, 2015 08:49 AM UTC
*Owner:* Neelakanta Reddy
*Attachments:*
* immlogs.tar.bz2
<http://sourceforge.net/p/opensaf/tickets/1291/attachment/immlogs.tar.bz2>
(6.8 MB; application/x-bzip)
The issue is observed with 4.6 FC changeset 6377. The system is up and
running with single pbe and 50k objects. This issue is seen after
http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM
application is running on standby controller and immcfg command is run
from payload to set CompRestartMax value to 1000. IMMND is killed
twice on standby controller leading to #1290.
As a result, standby controller left the cluster in middle of sync,
IMMD reported healthcheck callback timeout and the active controller
too went for reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for
node id 2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF
NodeId = 131599 EE Name = , Reason: Received Node Down for peer
controller, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC:
Resetting link <1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost
link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost
contact with <1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went
down. Not sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left
the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node
in the absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC:
Established link <1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
processed='center(queued)=2197', processed='center(received)=1172',
processed='destination(messages)=1172',
processed='destination(mailinfo)=0',
processed='destination(mailwarn)=0',
processed='destination(localmessages)=955',
processed='destination(newserr)=0',
processed='destination(mailerr)=0', processed='destination(netmgm)=0',
processed='destination(warn)=44', processed='destination(console)=13',
processed='destination(null)=0', processed='destination(mail)=0',
processed='destination(xconsole)=13',
processed='destination(firewall)=0', processed='destination(acpid)=0',
processed='destination(newscrit)=0',
processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN
on saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY
FAILED status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12
in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord
broadcasting ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12
committing with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover
probation timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing
failover of 'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action
escalated from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF
NodeId = 131343 EE Name = , Reason: Component faulted: recovery is
node failfast, OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
timeout=60
syslog, immnd and immd traces of SC-1 attached.
------------------------------------------------------------------------
Sent from sourceforge.net because
[email protected] is subscribed to
http://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change
settings at http://sourceforge.net/p/opensaf/admin/tickets/options.
Or, if this is a mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets