1)At slot2 of NTFD became active in RED callback:

Sep 18 12:28:01.498302 osafntfd [2360:ntfs_main.c:0127] << rda_cb
Sep 18 12:28:01.501475 osafntfd [2360:ntfs_evt.c:0162] >> proc_rda_cb_msg
Sep 18 12:28:01.501531 osafntfd [2360:ntfs_evt.c:0166] NO ACTIVE request
Sep 18 12:28:01.515359 osafntfd [2360:lga_mds.c:0481] T2 LGA Rcvd MDS subscribe 
evt from svc 20
Sep 18 12:28:01.515383 osafntfd [2360:lga_mds.c:0504] T2 MSG from LGS 
NCSMDS_NEW_ACTIVE/UP
Sep 18 12:28:01.521284 osafntfd [2360:ntfs_mbcsv.c:0180] >> 
ntfs_mbcsv_change_HA_state
Sep 18 12:28:01.521303 osafntfd [2360:mbcsv_api.c:0662] >> 
mbcsv_process_chg_role_request: Change HA role for the checkpoint
Sep 18 12:28:01.521311 osafntfd [2360:mbcsv_api.c:0685] TR svc_id:44, 
pwe_hdl:65550
Sep 18 12:28:01.521326 osafntfd [2360:mbcsv_api.c:0743] << 
mbcsv_process_chg_role_request: retval: 1
Sep 18 12:28:01.521332 osafntfd [2360:ntfs_mbcsv.c:0194] << 
ntfs_mbcsv_change_HA_state
Sep 18 12:28:01.521337 osafntfd [2360:NtfAdmin.cc:0693] >> checkNotificationList
Sep 18 12:28:01.521343 osafntfd [2360:NtfAdmin.cc:0726] << checkNotificationList
Sep 18 12:28:01.521348 osafntfd [2360:ntfs_evt.c:0189] << proc_rda_cb_msg

2) Problem occured in proccessing csi set callback in which NTFD tries to 
terminated IMCN process.
   NTFD got stuck:

Sep 18 12:28:01.567197 osafntfd [2360:ava_hdl.c:0648] TR CSISet: ActiveCompName 
= , StandbyRank = 0
Sep 18 12:28:01.567202 osafntfd [2360:ava_hdl.c:0650] TR Invoking component's 
saAmfCSISetCallback: InvocationId = ffa00002, component name = 
safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF
Sep 18 12:28:01.567208 osafntfd [2360:ntfs_amf.c:0174] >> amf_csi_set_callback
Sep 18 12:28:01.567223 osafntfd [2360:ntfs_amf.c:0042] >> 
amf_active_state_handler: HA ACTIVE request
Sep 18 12:28:01.567228 osafntfd [2360:ntfs_amf.c:0046] << 
amf_active_state_handler
Sep 18 12:28:01.567234 osafntfd [2360:ava_api.c:1836] >> saAmfResponse: 
SaAmfHandleT passed is ff000001
Sep 18 12:28:01.567239 osafntfd [2360:ava_hdl.c:0852] >> ava_hdl_pend_resp_get
Sep 18 12:28:01.567244 osafntfd [2360:ava_hdl.c:0868] << ava_hdl_pend_resp_get
Sep 18 12:28:01.567249 osafntfd [2360:ava_mds.c:0339] >> ava_mds_send
Sep 18 12:28:01.567254 osafntfd [2360:ava_mds.c:0690] >> ava_mds_msg_async_send
Sep 18 12:28:01.567263 osafntfd [2360:ava_mds.c:0179] >> ava_mds_cbk
Sep 18 12:28:01.567268 osafntfd [2360:ava_mds.c:0493] >> ava_mds_flat_enc
Sep 18 12:28:01.567274 osafntfd [2360:ava_mds.c:0510] << ava_mds_flat_enc: 
retval = 1
Sep 18 12:28:01.567288 osafntfd [2360:ava_mds.c:0242] TR MDS flat encode 
callback success
Sep 18 12:28:01.567293 osafntfd [2360:ava_mds.c:0316] << ava_mds_cbk
Sep 18 12:28:01.578124 osafntfd [2360:ava_mds.c:0715] << 
ava_mds_msg_async_send: retval = 1
Sep 18 12:28:01.578135 osafntfd [2360:ava_mds.c:0367] TR AVA MDS send success
Sep 18 12:28:01.578141 osafntfd [2360:ava_mds.c:0369] << ava_mds_send
Sep 18 12:28:01.578147 osafntfd [2360:ava_api.c:1918] TR Callback resonse 
completed
Sep 18 12:28:01.578160 osafntfd [2360:ava_api.c:1934] << saAmfResponse: rc:1
Sep 18 12:28:01.578170 osafntfd [2360:ntfs_imcnutil.c:0180] TR 
handle_state_ntfimcn: Terminating osafntfimcnd process

Find more traces in 1110_full.tgz. But traces of IMCN are not available.



Attachment: 1110_full.tgz (28.8 MB; application/x-compressed) 


---

** [tickets:#1110] NTF healthcheck callback timedout leading to node reboot**

**Status:** unassigned
**Milestone:** 4.3.3
**Created:** Thu Sep 18, 2014 07:41 AM UTC by Sirisha Alla
**Last Updated:** Thu Sep 18, 2014 11:25 AM UTC
**Owner:** Praveen

This issue is in continuation to ticket #1109.

During failover, the node that went for reboot failed to come up due to #1109. 
Just then NTF health check callback timeout happened on the then Active 
Controller leading to cluster reset.

Syslog of SC-2:

Sep 18 12:28:01 SLES-64BIT-SLOT2 osafamfd[2391]: NO FAILOVER StandBy --> Active
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmd[2327]: NO ellect_coord invoke from 
rda_callback ACTIVE
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmd[2327]: NO New coord elected, resides 
at 2020f
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO This IMMND is now the NEW 
Coord
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO PBE writing when new coord 
elected => force PBE to regenerate db file
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO STARTING PBE process.
.....
Sep 18 12:28:11 SLES-64BIT-SLOT2 osafamfnd[2401]: NO Assigned 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 18 12:28:21 SLES-64BIT-SLOT2 osafamfd[2391]: ER 
sendStateChangeNotificationAvd: saNtfNotificationSend Failed (5)
Sep 18 12:28:31 SLES-64BIT-SLOT2 kernel: [  111.656926] TIPC: Established link 
<1.1.2:eth0-1.1.1:eth0> on network plane A
Sep 18 12:28:32 SLES-64BIT-SLOT2 osafimmd[2327]: NO New IMMND process is on 
STANDBY Controller at 2010f
Sep 18 12:28:32 SLES-64BIT-SLOT2 osafimmd[2327]: NO Extended intro from node 
2010f
.......

SC-1 went for reboot because of #1109

Sep 18 12:29:40 SLES-64BIT-SLOT2 osaffmd[2317]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 
131599, SupervisionTime = 60
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [  180.896027] TIPC: Resetting link 
<1.1.2:eth0-1.1.1:eth0>, peer not responding
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [  180.896032] TIPC: Lost link 
<1.1.2:eth0-1.1.1:eth0> on network plane A
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [  180.896034] TIPC: Lost contact with 
<1.1.1>
Sep 18 12:29:40 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF


Health check callback timedout on NTF.

Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO SU failover probation 
timer started (timeout: 1200000000000 ns)
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO Performing failover of 
'safSu=SC-2,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO 
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' recovery action escalated from 
'componentFailover' to 'suFailover'
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO 
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: ER 
safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due 
to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Sep 18 12:33:54 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Sep 18 12:34:17 SLES-64BIT-SLOT2 syslog-ng[1139]: syslog-ng starting up; 
version='2.0.9'
Sep 18 12:34:18 SLES-64BIT-SLOT2 ifup:     lo

syslog and mds logs for both the controllers attached. NTFD traces on SC-2 
attached. 



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Slashdot TV.  Video for Nerds.  Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to