- **Milestone**: 4.4.2 --> 4.7-Tentative
---
** [tickets:#825] Quiesced controller goes for reboot and fails to join the
cluster**
**Status:** unassigned
**Milestone:** 4.7-Tentative
**Created:** Thu Mar 27, 2014 10:34 AM UTC by Sirisha Alla
**Last Updated:** Thu Jan 08, 2015 02:12 AM UTC
**Owner:** nobody
The issue is seen on 4.4RC2 tag code base on 4 node SLES VMs. Single PBE is
enabled. IMM tests are running during continuous switchovers.
Syslog of SLOT2(SC-2):
Mar 27 12:56:27 SLES-64BIT-SLOT2 osafamfd[5561]: NO safSi=SC-2N,safApp=OpenSAF
Swap done
Mar 27 12:56:27 SLES-64BIT-SLOT2 osafamfd[5561]: NO Controller switch over
initiated
Mar 27 12:56:27 SLES-64BIT-SLOT2 osafamfd[5561]: NO ROLE SWITCH Active -->
Quiesced
Mar 27 12:56:28 SLES-64BIT-SLOT2 osafrded[5477]: NO RDE role set to QUIESCED
Mar 27 12:56:29 SLES-64BIT-SLOT2 osafimmnd[5506]: NO Implementer disconnected
283 <10, 2020f> (safAmfService)
Mar 27 12:57:24 SLES-64BIT-SLOT2 osafamfnd[5571]: ER AMF director heart beat
timeout, generating core for amfd
Mar 27 12:57:25 SLES-64BIT-SLOT2 osafamfnd[5571]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: AMF director heart beat timeout, OwnNodeId = 131599,
SupervisionTime = 60
Mar 27 12:57:25 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Mar 27 12:57:29 SLES-64BIT-SLOT2 kernel: [ 6762.000085] md: stopping all md
devices.
Mar 27 12:57:29 SLES-64BIT-SLOT2 kernel: [ 6763.000007] sd 0:0:0:0: [sda]
Synchronizing SCSI cache
Mar 27 12:57:30 SLES-64BIT-SLOT2 kernel: [ 6764.001595] ohci_hcd 0000:00:06.0:
PCI INT A disabled
Syslog of SLOT1(SC-1):
Mar 27 12:56:27 SLES-64BIT-SLOT1 osafamfnd[7123]: NO Assigned
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Mar 27 12:56:27 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Implementer disconnected
281 <0, 2020f> (@OpenSafImmReplicatorB)
Mar 27 12:56:29 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Implementer disconnected
283 <0, 2020f> (safAmfService)
Mar 27 12:57:31 SLES-64BIT-SLOT1 osaffmd[7031]: NO Node Down event for node id
2020f:
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafclmd[7090]: NO proc_initialize_msg: send
failed. dest:2020f57f0c01c
Mar 27 12:57:31 SLES-64BIT-SLOT1 osaffmd[7031]: NO Current role: STANDBY
Mar 27 12:57:31 SLES-64BIT-SLOT1 osaffmd[7031]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131343, SupervisionTime = 60
Mar 27 12:57:31 SLES-64BIT-SLOT1 kernel: [ 6814.800090] TIPC: Resetting link
<1.1.1:eth0-1.1.2:eth0>, peer not responding
Mar 27 12:57:31 SLES-64BIT-SLOT1 kernel: [ 6814.800100] TIPC: Lost link
<1.1.1:eth0-1.1.2:eth0> on network plane A
Mar 27 12:57:31 SLES-64BIT-SLOT1 kernel: [ 6814.800105] TIPC: Lost contact with
<1.1.2>
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Global discard node
received for nodeId:2020f pid:5506
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Implementer disconnected
14 <0, 2020f(down)> (MsgQueueService131599)
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Implementer disconnected
285 <0, 2020f(down)> (@applier2testMA_verifyObjAbortCallbackNode_69_101)
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: WA Detected crash at node
2020f, abort ccbId 38
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmnd[7051]: NO Ccb 38 ABORTED (<released>)
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafimmd[7041]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Mar 27 12:57:31 SLES-64BIT-SLOT1 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Mar 27 12:57:31 SLES-64BIT-SLOT1 osaffmd[7031]: NO Controller Failover: Setting
role to ACTIVE
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafrded[7022]: NO RDE role set to ACTIVE
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafclmd[7090]: NO ACTIVE request
Mar 27 12:57:31 SLES-64BIT-SLOT1 osafamfd[7113]: NO FAILOVER StandBy --> Active
After this SLOT2(SC-2) does not join the cluster.
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafclmna[4133]: Started
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafclmna[4133]: NO
safNode=SC-2,safCluster=myClmCluster Joined cluster, nodeid=2020f
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafamfd[4142]: Started
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafamfd[4142]: WA configuration validation
error: Required attribute saAmfCtDefQuiescingCompleteTimeout not configured for
'safVersion=4.0.0,safCompType=Comp_2nApp_2n_1_1'
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafimmnd[4087]: NO Implementer (applier)
connected: 470 (@safAmfService2020f) <10, 2020f>
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafamfnd[4152]: Started
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafimmnd[4087]: NO Implementer (applier)
connected: 471 (@OpenSafImmReplicatorB) <4, 2020f>
Mar 27 14:56:23 SLES-64BIT-SLOT2 osafntfimcnd[4116]: NO Started
Mar 27 14:56:26 SLES-64BIT-SLOT2 osafamfd[4142]: NO Cold sync complete!
Mar 27 14:56:28 SLES-64BIT-SLOT2 osafimmnd[4087]: NO PBE-OI established on
other SC. Dumping incrementally to file imm.db
......
......
Mar 27 15:28:23 SLES-64BIT-SLOT2 opensafd[4026]: ER Timed-out for response from
AMFND
Mar 27 15:28:23 SLES-64BIT-SLOT2 opensafd[4026]: ER
Mar 27 15:28:23 SLES-64BIT-SLOT2 opensafd[4026]: ER Going for recovery
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafamfd[4142]: exiting for shutdown
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafamfnd[4152]: ER AMF director unexpectedly
crashed
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafamfnd[4152]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest)
received, OwnNodeId = 131599, SupervisionTime = 60
Mar 27 15:28:23 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafimmnd[4087]: NO Implementer locally
disconnected. Marking it as doomed 470 <10, 2020f> (@safAmfService2020f)
Mar 27 15:28:23 SLES-64BIT-SLOT2 osafimmnd[4087]: NO Implementer disconnected
470 <10, 2020f> (@safAmfService2020f)
The node joins the cluster only after the cluster is reset.
Traces of amfd and amfnd are attached.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets