---
** [tickets:#708] Standby Controller failed to come up when failover happened
in the middle of switchover**
**Status:** unassigned
**Created:** Fri Jan 03, 2014 03:48 PM UTC by Sirisha Alla
**Last Updated:** Fri Jan 03, 2014 03:48 PM UTC
**Owner:** nobody
The issue is observed on changeset 4733 + #220 patches corresponding to cs 4741
and cs 4742. The test setup is a 4 node SLES 64bit VMs.The setup is single PBE
enabled loaded with 25k objects.
Following steps are done when the issue is observed.
1) Trigger middleware switchover. Make sure the coordinator IMMND is on the
Active controller.
2) Reboot standby controller once the Active moves to quiesced.
Syslog on the controllers: SLOT2(SC-2) was active at the beginning of the test
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Peer FM down on node_id:
131343
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Role: STANDBY, Node Down for
node id: 2010f
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Failover occurred in the
middle of switchover
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131599, SupervisionTime = 60
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532113] TIPC: Resetting link
<1.1.2:eth0-1.1.1:eth0>, peer not responding
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532122] TIPC: Lost link
<1.1.2:eth0-1.1.1:eth0> on network plane A
Jan 3 15:38:56 SLES-64BIT-SLOT2 kernel: [ 614.532130] TIPC: Lost contact with
<1.1.1>
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Global discard node
received for nodeId:2010f pid:2412
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 8
<0, 2010f(down)> (safAmfService)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected
26 <0, 2010f(down)> (@safLogService)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected
27 <0, 2010f(down)> (@OpenSafImmReplicatorB)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 9
<0, 2010f(down)> (MsgQueueService131343)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmd[2343]: WA IMMD lost contact with peer
IMMD (NCSMDS_RED_DOWN)
Jan 3 15:38:56 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafrded[2324]: NO rde_rde_set_role: role set
to 1
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO FAILOVER StandBy --> Active
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: NO Role: ACTIVE, Node Down for
node id: 2010f
Jan 3 15:38:56 SLES-64BIT-SLOT2 osaffmd[2333]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131599, SupervisionTime = 60
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer disconnected 6
<18, 2020f> (@safAmfService2020f)
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafimmnd[2353]: NO Implementer connected: 28
(safAmfService) <18, 2020f>
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO Node 'SC-1' left the cluster
Jan 3 15:38:56 SLES-64BIT-SLOT2 osafamfd[2428]: NO FAILOVER StandBy --> Active
DONE!
(SLOT1)2010f did not come up after reboot. The node was rebooted again and
following are the logs.
Jan 3 16:11:44 SLES-64BIT-SLOT1 osafrded[2849]: NO rde@2020f has active state
=> Standby role
Jan 3 16:11:44 SLES-64BIT-SLOT1 osaffmd[2859]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: Started
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Persistent Back-End
capability configured, Pbe file:imm.db (suffix may get added)
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE:
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE:
IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE:
IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE->
IMM_NODE_ISOLATED
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: Ruling epoch noted
as:23
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO IMMND coord at 2020f
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY:
SaImmRepositoryInitModeT changed and noted as 'SA_IMM_KEEP_REPOSITORY'
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE->
IMM_NODE_W_AVAILABLE
Jan 3 16:11:45 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE:
IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND
process at node 2020f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO IMMND coord at 2020f
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND
process at node 2030f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND
process at node 2040f old epoch: 22 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE 2261
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO RepositoryInitModeT is
SA_IMM_KEEP_REPOSITORY
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Epoch set to 23 in ImmModel
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmd[2869]: NO SBY: New Epoch for IMMND
process at node 2010f old epoch: 0 new epoch:23
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO SERVER STATE:
IMM_SERVER_SYNC_CLIENT --> IMM SERVER READY
Jan 3 16:12:00 SLES-64BIT-SLOT1 osaflogd[2892]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osaflogd[2892]: NO log root directory is:
/var/log/opensaf/saflog
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier)
connected: 34 (@safLogService) <3, 2010f>
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafntfd[2905]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmd[2919]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmna[2929]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafclmna[2929]: NO
safNode=SC-1,safCluster=myClmCluster Joined cluster, nodeid=2010f
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafamfd[2938]: Started
Jan 3 16:12:00 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier)
connected: 35 (@safAmfService2010f) <10, 2010f>
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafimmnd[2879]: NO Implementer (applier)
connected: 36 (@OpenSafImmReplicatorB) <4, 2010f>
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafntfimcnd[2913]: NO Started
Jan 3 16:12:01 SLES-64BIT-SLOT1 osafamfnd[2948]: Started
Jan 3 16:12:03 SLES-64BIT-SLOT1 osafamfd[2938]: NO Cold sync complete!
Jan 3 16:12:06 SLES-64BIT-SLOT1 osafimmnd[2879]: NO PBE-OI established on
other SC. Dumping incrementally to file imm.db
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: renewing lease of
192.168.56.101
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: leased 192.168.56.101 for
3600 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: no renewal time supplied,
assuming 1800 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: no rebind time supplied,
assuming 3150 seconds
Jan 3 16:18:05 SLES-64BIT-SLOT1 dhcpcd[2158]: eth0: adding IP address
192.168.56.101/24
Jan 3 16:18:05 SLES-64BIT-SLOT1 ifup: eth0 device: Intel Corporation
82540EM Gigabit Ethernet Controller (rev 02)
Jan 3 16:18:05 SLES-64BIT-SLOT1 SuSEfirewall2: SuSEfirewall2 not active
Jan 3 16:38:05 SLES-64BIT-SLOT1 -- MARK --
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER Timed-out for response from
AMFND
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER
Jan 3 16:44:01 SLES-64BIT-SLOT1 opensafd[2818]: ER Going for recovery
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfd[2938]: exiting on signal 15
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfnd[2948]: ER AMF director unexpectedly
crashed
Jan 3 16:44:01 SLES-64BIT-SLOT1 osafamfnd[2948]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: local AVD down(Adest) or both AVD down(Vdest)
received, OwnNodeId = 131343, SupervisionTime = 60
Following message is observed in AMFD traces on SLOT2:
Jan 3 16:43:54.288337 osafamfd [2428:ndfsm.cc:0058] TR invalid node ID (2010f)
Jan 3 16:43:54.288354 osafamfd [2428:ndfsm.cc:0162] << avd_node_up_evh
Jan 3 16:43:54.288369 osafamfd [2428:mbcsv_api.c:0773] >>
mbcsv_process_snd_ckpt_request: Sending checkpoint data to all STANDBY peers,
as per the send-type specified
Jan 3 16:43:54.288386 osafamfd [2428:mbcsv_api.c:0803] TR svc_id:10,
pwe_hdl:65537
Jan 3 16:43:54.288401 osafamfd [2428:mbcsv_util.c:0343] >>
mbcsv_send_ckpt_data_to_all_peers
Jan 3 16:43:54.288456 osafamfd [2428:mbcsv_util.c:0387] TR dispatching FSM for
NCSMBCSV_SEND_ASYNC_UPDATE
Jan 3 16:43:54.288486 osafamfd [2428:mbcsv_act.c:0101] TR ASYNC update to be
sent. role: 1, svc_id: 10, pwe_hdl: 65537
Jan 3 16:43:54.288503 osafamfd [2428:mbcsv_util.c:0399] TR calling encode
callback
Jan 3 16:43:54.288520 osafamfd [2428:mbcsv_util.c:0438] TR send the encoded
message to any other peer with same s/w version
Jan 3 16:43:54.288766 osafamfd [2428:mbcsv_util.c:0441] TR dispatching FSM for
NCSMBCSV_SEND_ASYNC_UPDATE
Jan 3 16:43:54.288783 osafamfd [2428:mbcsv_act.c:0101] TR ASYNC update to be
sent. role: 1, svc_id: 10, pwe_hdl: 65537
Jan 3 16:43:54.288798 osafamfd [2428:mbcsv_mds.c:0185] >> mbcsv_mds_send_msg:
sending to vdest:1
Jan 3 16:43:54.288813 osafamfd [2428:mbcsv_mds.c:0209] TR send type
MDS_SENDTYPE_REDRSP:
Jan 3 16:43:54.289876 osafamfd [2428:mbcsv_mds.c:0244] << mbcsv_mds_send_msg:
success
Jan 3 16:43:54.289923 osafamfd [2428:mbcsv_util.c:0492] <<
mbcsv_send_ckpt_data_to_all_peers
Jan 3 16:43:54.289941 osafamfd [2428:mbcsv_api.c:0868] <<
mbcsv_process_snd_ckpt_request: retval: 1
Jan 3 16:43:54.289959 osafamfd [2428:imm.cc:0235] >> peek
Jan 3 16:43:54.289973 osafamfd [2428:imm.cc:0244] << peek
Jan 3 16:43:55.383747 osafamfd [2428:ndmsg.cc:0345] >> avd_n2d_msg_rcv
Jan 3 16:43:55.383915 osafamfd [2428:ndfsm.cc:0054] >> avd_node_up_evh: from
2010f
Jan 3 16:43:55.383945 osafamfd [2428:ndfsm.cc:0058] TR invalid node ID (2010f)
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349831&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets