- **status**: assigned --> unassigned
- **assigned_to**: Anders Bjornerstedt --> nobody
- **Component**: imm --> amf
- **Part**: - --> d
- **Comment**:
This is what the syslog on SLOT2 says at the time of this incident:
Mar 27 12:56:27 SLES-64BIT-SLOT2 osafamfd[5561]: NO ROLE SWITCH Active -->
Quiesced
(AndersBj: That was last syslog msg from AMFD)
Mar 27 12:56:28 SLES-64BIT-SLOT2 osafrded[5477]: NO RDE role set to QUIESCED
Mar 27 12:56:29 SLES-64BIT-SLOT2 osafimmnd[5506]: NO Implementer disconnected
283 <10, 2020f> (safAmfService)
(AndersBj: IMMND reports that loacal AMFD disconnects as main OI.
Then there is almost a minutes of total syslog silence from all
services at SLOT2 until..)
Mar 27 12:57:24 SLES-64BIT-SLOT2 osafamfnd[5571]: ER AMF director heart beat
timeout, generating core for amfd
---------------------------
(AndersBj: So the syslogs do appear to show that the AMFD is stuck
on something during the time between 12:56:29 and 12:57:24.)
Looking at the AMFD traces for this time we see only:)
Mar 27 12:56:28.584224 osafamfd [5561:imm.cc:0235] >> peek
Mar 27 12:56:28.584228 osafamfd [5561:imm.cc:0244] << peek
Mar 27 12:56:29.350611 osafamfd [5561:mds.cc:0491] >> avd_mds_qsd_ack_evt
Mar 27 12:56:29.350748 osafamfd [5561:role.cc:0569] >> avd_mds_qsd_role_evh
Mar 27 12:56:29.356363 osafamfd [5561:imm.cc:0307] >> empty
Mar 27 12:56:29.356363 osafamfd [5561:imm.cc:0263] >> dequeue
Mar 27 12:56:29.356363 osafamfd [5561:imm.cc:0273] << dequeue
Mar 27 12:56:29.356363 osafamfd [5561:imm.cc:0313] << empty
Mar 27 12:56:29.356364 osafamfd [5561:imm.cc:1222] >> avd_imm_applier_set
Mar 27 12:56:34.850807 osafamfd [5561:timer.cc:0148] >> avd_tmr_exp
Mar 27 12:56:34.851057 osafamfd [5561:timer.cc:0169] << avd_tmr_exp :
AndersBj: So yes the AMFD seems to get stuck for a whole minute
inside avd_imm_applier_set )
There are no IMMND traces attached so we can not see if the immnd
is hung. That is unlikely though since the imm has only one
application thread and invokes practically no syncronous calls
except towards MDS. Traces from IMMND would have shown if the IMMND
was blocked or looping. Note here that the IMMNDs are totally
oblivious to si-swap and there is no switch of immnd coord in relation
to si-swap so the IMMNDs are unlikely to get triggered into any
unusual behavior just because there is an si-swap, plus there have
already been N successfull si-swaps just before this.
The avd_imm_applier_set() does:
if ((rc = immutil_saImmOiImplementerSet(avd_cb->immOiHandle,
applier_name)) != SA_AIS_OK) {
LOG_ER("saImmOiImplementerSet failed %u", rc);
return rc;
}
No ERror is logged by AMFD so we know AMFD somehow is blocked
in this immutils call. The implementer-set appears does not reach
the immnd. Implementer-set is not blocked by sync and there is no
sync on-going anyway.
So my best guess is thet the MDS message from AMFD to IMMND does not
get delivered.
But the main mystery is this: All IMM APIs hava a 10 second
client timeout default, so unless AMFD has increased
IMMA_SYNCR_TIMEOUT to 60+ seconds there would have been an
ERR_TIMEOUT generated purely on the client imma side.
And that error would have been log'ed by AMFD.
There is also the immutils layer inbetween here. If there
where a TRY_AGAIN and you had configured immutils so that it did
internal retries for 60 seconds then that could explain it. But I
have to assume that you have just the regulare 10 sec timeout in AMFD.
So even IF the IMMND is blocked or there is an MDS message lost or..
>>WHY does NOT the immUtils call return after 10 sec with ERR_TIMEOUT ?
You could try to reproduce with trace on for IMMND. But I doubt this
will help because the timeout issue seems to be the dominating question
here.
The missing timeout should be generated by the MDS library.
Is si-swap doing any state changes towards MDS that could go wrong?
I was contemplating to set component to MDS, but decided to set it back
to AMF first. If AMF has default settings on IMMA_SYNCR_TIMEOUT and
it can be seen in MDS traces that the request from AMFD is generated
yet the 10 sec timeout never happens, then the problem would seem to
be in MDS. That is a library though so it could also be a heap
corruption issue in the AMFD.
---
** [tickets:#825] Quiesced controller goes for reboot and fails to join the
cluster**
**Status:** unassigned
**Milestone:** 4.3.3
**Created:** Thu Mar 27, 2014 10:34 AM UTC by Sirisha Alla
**Last Updated:** Tue Aug 26, 2014 07:24 AM UTC
**Owner:** nobody
The issue is seen on 4.4RC2 tag code base on 4 node SLES VMs. Single PBE is
enabled. IMM tests are running during continuous switchovers.
Syslog of SLOT2(SC-2):
Mar 27 12:56:27 SLES-64BIT-SLOT2 osafamfd[5561]: NO safSi=SC-2N,safApp=OpenSAF
Swap done
Mar 27 12:56:27 SLES-64BIT-SLOT2 osafamfd[5561]: NO Controller switch over
initiated
Mar 27 12:56:27 SLES-64BIT-SLOT2 osafamfd[5561]: NO ROLE SWITCH Active -->
Quiesced
Mar 27 12:56:28 SLES-64BIT-SLOT2 osafrded[5477]: NO RDE role set to QUIESCED
Mar 27 12:56:29 SLES-64BIT-SLOT2 osafimmnd[5506]: NO Implementer disconnected
283 <10, 2020f> (safAmfService)
Mar 27 12:57:24 SLES-64BIT-SLOT2 osafamfnd[5571]: ER AMF director heart beat
timeout, generating core for amfd
Mar 27 12:57:25 SLES-64BIT-SLOT2 osafamfnd[5571]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: AMF director heart beat timeout, OwnNodeId = 131599,
SupervisionTime = 60
Mar 27 12:57:25 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Mar 27 12:57:29 SLES-64BIT-SLOT2 kernel: [ 6762.000085] md: stopping all md
devices.
Mar 27 12:57:29 SLES-64BIT-SLOT2 kernel: [ 6763.000007] sd 0:0:0:0: [sda]
Synchronizing SCSI cache
Mar 27 12:57:30 SLES-64BIT-SLOT2 kernel: [ 6764.001595] ohci_hcd 0000:00:06.0:
PCI INT A disabled
Syslog of SLOT1(SC-1):
Mar 27 12:56:27 SLES-64BIT-SLOT1 osafamfnd[7123]: NO Assigned
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Mar 27 12:56:27 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Implementer disconnected
281 <0, 2020f> (@OpenSafImmReplicatorB)
Mar 27 12:56:29 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Implementer disconnected
283 <0, 2020f> (safAmfService)
Mar 27 12:57:31 SLES-64BIT-SLOT1 osaffmd[7031]: NO Node Down event for node id
2020f:
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafclmd[7090]: NO proc_initialize_msg: send
failed. dest:2020f57f0c01c
Mar 27 12:57:31 SLES-64BIT-SLOT1 osaffmd[7031]: NO Current role: STANDBY
Mar 27 12:57:31 SLES-64BIT-SLOT1 osaffmd[7031]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131343, SupervisionTime = 60
Mar 27 12:57:31 SLES-64BIT-SLOT1 kernel: [ 6814.800090] TIPC: Resetting link
<1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 27 12:57:31 SLES-64BIT-SLOT1 kernel: [ 6814.800100] TIPC: Lost link
<1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 27 12:57:31 SLES-64BIT-SLOT1 kernel: [ 6814.800105] TIPC: Lost contact with
<1.1.2>
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Global discard node
received for nodeId:2020f pid:5506
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Implementer disconnected
14 <0, 2020f(down)> (MsgQueueService131599)
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Implementer disconnected
285 <0, 2020f(down)> (@applier2testMA_verifyObjAbortCallbackNode_69_101)
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: WA Detected crash at node
2020f, abort ccbId 38
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Ccb 38 ABORTED (<released>)
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmd[7041]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Mar 27 12:57:31 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Mar 27 12:57:31 SLES-64BIT-SLOT1 osaffmd[7031]: NO Controller Failover: Setting
role to ACTIVE
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafrded[7022]: NO RDE role set to ACTIVE
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafclmd[7090]: NO ACTIVE request
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafamfd[7113]: NO FAILOVER StandBy --> Active
After this SLOT2(SC-2) does not join the cluster.
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafclmna[4133]: Started
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafclmna[4133]: NO
safNode=SC-2,safCluster=myClmCluster Joined cluster, nodeid=2020f
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafamfd[4142]: Started
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafamfd[4142]: WA configuration validation
error: Required attribute saAmfCtDefQuiescingCompleteTimeout not configured for
'safVersion=4.0.0,safCompType=Comp_2nApp_2n_1_1'
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafimmnd[4087]: NO Implementer (applier)
connected: 470 (@safAmfService2020f) <10, 2020f>
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafamfnd[4152]: Started
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafimmnd[4087]: NO Implementer (applier)
connected: 471 (@OpenSafImmReplicatorB) <4, 2020f>
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafntfimcnd[4116]: NO Started
Mar 27 14:56:26 SLES-64BIT-SLOT2 osafamfd[4142]: NO Cold sync complete!
Mar 27 14:56:28 SLES-64BIT-SLOT2 osafimmnd[4087]: NO PBE-OI established on
other SC. Dumping incrementally to file imm.db
......
......
Mar 27 15:28:23 SLES-64BIT-SLOT2 opensafd[4026]: ER Timed-out for response from
AMFND
Mar 27 15:28:23 SLES-64BIT-SLOT2 opensafd[4026]: ER
Mar 27 15:28:23 SLES-64BIT-SLOT2 opensafd[4026]: ER Going for recovery
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafamfd[4142]: exiting for shutdown
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafamfnd[4152]: ER AMF director unexpectedly
crashed
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafamfnd[4152]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest)
received, OwnNodeId = 131599, SupervisionTime = 60
Mar 27 15:28:23 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafimmnd[4087]: NO Implementer locally
disconnected. Marking it as doomed 470 <10, 2020f> (@safAmfService2020f)
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafimmnd[4087]: NO Implementer disconnected
470 <10, 2020f> (@safAmfService2020f)
The node joins the cluster only after the cluster is reset.
Traces of amfd and amfnd are attached.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets