1)At slot2 of NTFD became active in RED callback:
Sep 18 12:28:01.498302 osafntfd [2360:ntfs_main.c:0127] << rda_cb
Sep 18 12:28:01.501475 osafntfd [2360:ntfs_evt.c:0162] >> proc_rda_cb_msg
Sep 18 12:28:01.501531 osafntfd [2360:ntfs_evt.c:0166] NO ACTIVE request
Sep 18 12:28:01.515359 osafntfd [2360:lga_mds.c:0481] T2 LGA Rcvd MDS subscribe
evt from svc 20
Sep 18 12:28:01.515383 osafntfd [2360:lga_mds.c:0504] T2 MSG from LGS
NCSMDS_NEW_ACTIVE/UP
Sep 18 12:28:01.521284 osafntfd [2360:ntfs_mbcsv.c:0180] >>
ntfs_mbcsv_change_HA_state
Sep 18 12:28:01.521303 osafntfd [2360:mbcsv_api.c:0662] >>
mbcsv_process_chg_role_request: Change HA role for the checkpoint
Sep 18 12:28:01.521311 osafntfd [2360:mbcsv_api.c:0685] TR svc_id:44,
pwe_hdl:65550
Sep 18 12:28:01.521326 osafntfd [2360:mbcsv_api.c:0743] <<
mbcsv_process_chg_role_request: retval: 1
Sep 18 12:28:01.521332 osafntfd [2360:ntfs_mbcsv.c:0194] <<
ntfs_mbcsv_change_HA_state
Sep 18 12:28:01.521337 osafntfd [2360:NtfAdmin.cc:0693] >> checkNotificationList
Sep 18 12:28:01.521343 osafntfd [2360:NtfAdmin.cc:0726] << checkNotificationList
Sep 18 12:28:01.521348 osafntfd [2360:ntfs_evt.c:0189] << proc_rda_cb_msg
2) Problem occured in proccessing csi set callback in which NTFD tries to
terminated IMCN process.
NTFD got stuck:
Sep 18 12:28:01.567197 osafntfd [2360:ava_hdl.c:0648] TR CSISet: ActiveCompName
= , StandbyRank = 0
Sep 18 12:28:01.567202 osafntfd [2360:ava_hdl.c:0650] TR Invoking component's
saAmfCSISetCallback: InvocationId = ffa00002, component name =
safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF
Sep 18 12:28:01.567208 osafntfd [2360:ntfs_amf.c:0174] >> amf_csi_set_callback
Sep 18 12:28:01.567223 osafntfd [2360:ntfs_amf.c:0042] >>
amf_active_state_handler: HA ACTIVE request
Sep 18 12:28:01.567228 osafntfd [2360:ntfs_amf.c:0046] <<
amf_active_state_handler
Sep 18 12:28:01.567234 osafntfd [2360:ava_api.c:1836] >> saAmfResponse:
SaAmfHandleT passed is ff000001
Sep 18 12:28:01.567239 osafntfd [2360:ava_hdl.c:0852] >> ava_hdl_pend_resp_get
Sep 18 12:28:01.567244 osafntfd [2360:ava_hdl.c:0868] << ava_hdl_pend_resp_get
Sep 18 12:28:01.567249 osafntfd [2360:ava_mds.c:0339] >> ava_mds_send
Sep 18 12:28:01.567254 osafntfd [2360:ava_mds.c:0690] >> ava_mds_msg_async_send
Sep 18 12:28:01.567263 osafntfd [2360:ava_mds.c:0179] >> ava_mds_cbk
Sep 18 12:28:01.567268 osafntfd [2360:ava_mds.c:0493] >> ava_mds_flat_enc
Sep 18 12:28:01.567274 osafntfd [2360:ava_mds.c:0510] << ava_mds_flat_enc:
retval = 1
Sep 18 12:28:01.567288 osafntfd [2360:ava_mds.c:0242] TR MDS flat encode
callback success
Sep 18 12:28:01.567293 osafntfd [2360:ava_mds.c:0316] << ava_mds_cbk
Sep 18 12:28:01.578124 osafntfd [2360:ava_mds.c:0715] <<
ava_mds_msg_async_send: retval = 1
Sep 18 12:28:01.578135 osafntfd [2360:ava_mds.c:0367] TR AVA MDS send success
Sep 18 12:28:01.578141 osafntfd [2360:ava_mds.c:0369] << ava_mds_send
Sep 18 12:28:01.578147 osafntfd [2360:ava_api.c:1918] TR Callback resonse
completed
Sep 18 12:28:01.578160 osafntfd [2360:ava_api.c:1934] << saAmfResponse: rc:1
Sep 18 12:28:01.578170 osafntfd [2360:ntfs_imcnutil.c:0180] TR
handle_state_ntfimcn: Terminating osafntfimcnd process
Find more traces in 1110_full.tgz. But traces of IMCN are not available.
Attachment: 1110_full.tgz (28.8 MB; application/x-compressed)
---
** [tickets:#1110] NTF healthcheck callback timedout leading to node reboot**
**Status:** unassigned
**Milestone:** 4.3.3
**Created:** Thu Sep 18, 2014 07:41 AM UTC by Sirisha Alla
**Last Updated:** Thu Sep 18, 2014 11:25 AM UTC
**Owner:** Praveen
This issue is in continuation to ticket #1109.
During failover, the node that went for reboot failed to come up due to #1109.
Just then NTF health check callback timeout happened on the then Active
Controller leading to cluster reset.
Syslog of SC-2:
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafamfd[2391]: NO FAILOVER StandBy --> Active
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmd[2327]: NO ellect_coord invoke from
rda_callback ACTIVE
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmd[2327]: NO New coord elected, resides
at 2020f
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO This IMMND is now the NEW
Coord
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO PBE writing when new coord
elected => force PBE to regenerate db file
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO STARTING PBE process.
.....
Sep 18 12:28:11 SLES-64BIT-SLOT2 osafamfnd[2401]: NO Assigned
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 18 12:28:21 SLES-64BIT-SLOT2 osafamfd[2391]: ER
sendStateChangeNotificationAvd: saNtfNotificationSend Failed (5)
Sep 18 12:28:31 SLES-64BIT-SLOT2 kernel: [ 111.656926] TIPC: Established link
<1.1.2:eth0-1.1.1:eth0> on network plane A
Sep 18 12:28:32 SLES-64BIT-SLOT2 osafimmd[2327]: NO New IMMND process is on
STANDBY Controller at 2010f
Sep 18 12:28:32 SLES-64BIT-SLOT2 osafimmd[2327]: NO Extended intro from node
2010f
.......
SC-1 went for reboot because of #1109
Sep 18 12:29:40 SLES-64BIT-SLOT2 osaffmd[2317]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131599, SupervisionTime = 60
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [ 180.896027] TIPC: Resetting link
<1.1.2:eth0-1.1.1:eth0>, peer not responding
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [ 180.896032] TIPC: Lost link
<1.1.2:eth0-1.1.1:eth0> on network plane A
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [ 180.896034] TIPC: Lost contact with
<1.1.1>
Sep 18 12:29:40 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Health check callback timedout on NTF.
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO SU failover probation
timer started (timeout: 1200000000000 ns)
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO Performing failover of
'safSu=SC-2,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' recovery action escalated from
'componentFailover' to 'suFailover'
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: ER
safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131599, SupervisionTime = 60
Sep 18 12:33:54 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Sep 18 12:34:17 SLES-64BIT-SLOT2 syslog-ng[1139]: syslog-ng starting up;
version='2.0.9'
Sep 18 12:34:18 SLES-64BIT-SLOT2 ifup: lo
syslog and mds logs for both the controllers attached. NTFD traces on SC-2
attached.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Slashdot TV. Video for Nerds. Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets