Hi,
Having sync always wait for two seconds causes a major performance reduction
for sync on some configurations. Because of this the patch is reverted #1188.
In #1188, retry for each TRY_AGIN is increased exponentially from 10 msec to
0.5 sec and the number of TRY_AGAIN are 320.
In the present problem scenario, there is no link-loss and TIPC tolerance
time may not be a case but the MDS send/recive buffer got filled because of
multicast.
The following adjustment are made as part of the changeset 5578.
+/* Adjust to 90% of MDS_DIRECT_BUF_MAXSIZE */
+#define IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE ((MDS_DIRECT_BUF_MAXSIZE / 100) *
90)
+#define IMMSV_MAX_OBJS_IN_SYNCBATCH (IMMSV_DEFAULT_MAX_SYNC_BATCH_SIZE / 10)
since the MAX_SYNC_BATCH_SIZE is fixed, and some machines may or may not have
larger
send and recevier buffer sizes. MAX_SYNC_BATCH_SIZE needs to be more fine tuned
when
multicast is used.
/Neel.
---
** [tickets:#1291] MDS: IMMD healthcheck callback timeout when standby
controller rebooted in middle of IMMND sync**
**Status:** unassigned
**Milestone:** 4.6.RC1
**Created:** Mon Mar 30, 2015 07:21 AM UTC by Sirisha Alla
**Last Updated:** Tue Mar 31, 2015 04:03 AM UTC
**Owner:** nobody
The issue is observed with 4.6 FC changeset 6377. The system is up and running
with single pbe and 50k objects. This issue is seen after
http://sourceforge.net/p/opensaf/tickets/1290 is observed. IMM application is
running on standby controller and immcfg command is run from payload to set
CompRestartMax value to 1000. IMMND is killed twice on standby controller
leading to #1290.
As a result, standby controller left the cluster in middle of sync, IMMD
reported healthcheck callback timeout and the active controller too went for
reboot. Following is the syslog of SC-1:
Mar 26 14:58:17 SLES-64BIT-SLOT1 osafimmloadd: NO Sync starting
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Node Down event for node id
2020f:
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: NO Current role: ACTIVE
Mar 26 14:58:28 SLES-64BIT-SLOT1 osaffmd[9529]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131343, SupervisionTime = 60
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412080] TIPC: Resetting link
<1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.412089] TIPC: Lost link
<1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 14:58:28 SLES-64BIT-SLOT1 kernel: [15200.413191] TIPC: Lost contact with
<1.1.2>
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not
sending track callback for agents on that node
Mar 26 14:58:28 SLES-64BIT-SLOT1 osafclmd[9609]: NO Node 131599 went down. Not
sending track callback for agents on that node
Mar 26 14:58:30 SLES-64BIT-SLOT1 osafamfd[9628]: NO Node 'SC-2' left the cluster
Mar 26 14:58:30 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Mar 26 14:58:54 SLES-64BIT-SLOT1 kernel: [15226.674333] TIPC: Established link
<1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 26 15:00:02 SLES-64BIT-SLOT1 syslog-ng[3261]: Log statistics;
dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0',
processed='center(queued)=2197', processed='center(received)=1172',
processed='destination(messages)=1172', processed='destination(mailinfo)=0',
processed='destination(mailwarn)=0',
processed='destination(localmessages)=955', processed='destination(newserr)=0',
processed='destination(mailerr)=0', processed='destination(netmgm)=0',
processed='destination(warn)=44', processed='destination(console)=13',
processed='destination(null)=0', processed='destination(mail)=0',
processed='destination(xconsole)=13', processed='destination(firewall)=0',
processed='destination(acpid)=0', processed='destination(newscrit)=0',
processed='destination(newsnotice)=0', processed='source(src)=1172'
Mar 26 15:00:07 SLES-64BIT-SLOT1 osafimmloadd: ER Too many TRY_AGAIN on
saImmOmSearchNext - aborting
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: ER SYNC APPARENTLY FAILED
status:1
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO -SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE (2484)
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Epoch set to 12 in ImmModel
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmnd[9549]: NO Coord broadcasting
ABORT_SYNC, epoch:12
Mar 26 15:00:08 SLES-64BIT-SLOT1 osafimmpbed: NO Update epoch 12 committing
with ccbId:100000054/4294967380
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO SU failover probation
timer started (timeout: 1200000000000 ns)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO Performing failover of
'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated
from 'componentFailover' to 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: ER
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
Mar 26 15:01:34 SLES-64BIT-SLOT1 osafamfnd[9638]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131343, SupervisionTime = 60
Mar 26 15:01:34 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting local node;
timeout=60
syslog, immnd and immd traces of SC-1 attached.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets