An analysis and some more info:

- Current SC-1 is active, SC-2 is standby.
- Power off SC-1 so SC-2 is FAILOVER standby -> active.
At this time, NTF service on SC-2 try to terminate the child process 
osafntfimcnd. But the terminating can not successfully.
So NTF service on SC-2 was hang.
And after some minutes past, the NTF healthcheck callback time out and make 
SC-2 reboot.

On SC-1:
Sep 12 12:53:56: Power off active SC-1

Syslog SC-2:
Sep 12 12:53:56 SC-2 osafamfd[10248]: NO FAILOVER StandBy --> Active
Sep 12 12:53:56 SC-2 osafimmd[10159]: NO ellect_coord invoke from rda_callback 
ACTIVE
Sep 12 12:53:56 SC-2 osafimmd[10159]: NO New coord elected, resides at 2020f
Sep 12 12:53:56 SC-2 osafimmnd[10169]: NO This IMMND is now the NEW Coord
....
....
Sep 12 12:58:45 SC-2 osafamfnd[10261]: NO Performing failover of 
'safSu=SC-2,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Sep 12 12:58:45 SC-2 osafamfnd[10261]: NO 
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' recovery action escalated from 
'componentFailover' to 'suFailover'
Sep 12 12:58:45 SC-2 osafamfnd[10261]: NO 
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 12 12:58:45 SC-2 osafamfnd[10261]: ER 
safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due 
to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 12 12:58:45 SC-2 osafamfnd[10261]: Rebooting OpenSAF NodeId = 131599 EE 
Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 
131599, SupervisionTime = 60
Sep 12 12:58:45 SC-2 opensaf_reboot: Rebooting local node; timeout=60
Sep 12 12:59:15 SC-2 syslog-ng[3095]: syslog-ng starting up; version='2.0.9'


NTF trace SC-2:
Sep 12 12:53:57.072635 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0630] 
TR CSISet: CSIName = , CSIFlags = 4, HA state = 1
Sep 12 12:53:57.072641 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0645] 
TR CSISet: Active Transition Descriptor = 1, Active Component Name = 
Sep 12 12:53:57.072646 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0648] 
TR CSISet: ActiveCompName = , StandbyRank = 0
Sep 12 12:53:57.072652 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0650] 
TR Invoking component's saAmfCSISetCallback: InvocationId = f8e00001, component 
name = safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF
Sep 12 12:53:57.072657 osafntfd 
[10192:../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfs/ntfs_amf.c:0174]
 >> amf_csi_set_callback 
Sep 12 12:53:57.072663 osafntfd 
[10192:../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfs/ntfs_amf.c:0042]
 >> amf_active_state_handler: HA ACTIVE request
Sep 12 12:53:57.072669 osafntfd 
[10192:../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfs/ntfs_amf.c:0046]
 << amf_active_state_handler 
Sep 12 12:53:57.072685 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_api.c:1836] 
>> saAmfResponse: SaAmfHandleT passed is ff000001
Sep 12 12:53:57.072691 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0852] 
>> ava_hdl_pend_resp_get 
Sep 12 12:53:57.072696 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0868] 
<< ava_hdl_pend_resp_get 
Sep 12 12:53:57.072701 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0339] 
>> ava_mds_send 
Sep 12 12:53:57.072706 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0690] 
>> ava_mds_msg_async_send 
Sep 12 12:53:57.072714 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0179] 
>> ava_mds_cbk 
Sep 12 12:53:57.072719 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0493] 
>> ava_mds_flat_enc 
Sep 12 12:53:57.072726 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0510] 
<< ava_mds_flat_enc: retval = 1
Sep 12 12:53:57.072731 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0242] 
TR MDS flat encode callback success
Sep 12 12:53:57.072736 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0316] 
<< ava_mds_cbk 
Sep 12 12:53:57.072754 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0715] 
<< ava_mds_msg_async_send: retval = 1
Sep 12 12:53:57.072761 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0367] 
TR AVA MDS send success
Sep 12 12:53:57.072766 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0369] 
<< ava_mds_send 
Sep 12 12:53:57.072785 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_api.c:1918] 
TR Callback resonse completed
Sep 12 12:53:57.072791 osafntfd 
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_api.c:1934] 
<< saAmfResponse: rc:1
Sep 12 12:53:57.072797 osafntfd 
[10192:../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfs/ntfs_imcnutil.c:0180]
 TR handle_state_ntfimcn: Terminating osafntfimcnd process



---

** [tickets:#1110] NTF healthcheck callback timedout leading to node reboot**

**Status:** unassigned
**Milestone:** 4.3.3
**Created:** Thu Sep 18, 2014 07:41 AM UTC by Sirisha Alla
**Last Updated:** Fri Sep 19, 2014 10:34 AM UTC
**Owner:** Praveen

This issue is in continuation to ticket #1109.

During failover, the node that went for reboot failed to come up due to #1109. 
Just then NTF health check callback timeout happened on the then Active 
Controller leading to cluster reset.

Syslog of SC-2:

Sep 18 12:28:01 SLES-64BIT-SLOT2 osafamfd[2391]: NO FAILOVER StandBy --> Active
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmd[2327]: NO ellect_coord invoke from 
rda_callback ACTIVE
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmd[2327]: NO New coord elected, resides 
at 2020f
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO This IMMND is now the NEW 
Coord
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO PBE writing when new coord 
elected => force PBE to regenerate db file
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO STARTING PBE process.
.....
Sep 18 12:28:11 SLES-64BIT-SLOT2 osafamfnd[2401]: NO Assigned 
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 18 12:28:21 SLES-64BIT-SLOT2 osafamfd[2391]: ER 
sendStateChangeNotificationAvd: saNtfNotificationSend Failed (5)
Sep 18 12:28:31 SLES-64BIT-SLOT2 kernel: [  111.656926] TIPC: Established link 
<1.1.2:eth0-1.1.1:eth0> on network plane A
Sep 18 12:28:32 SLES-64BIT-SLOT2 osafimmd[2327]: NO New IMMND process is on 
STANDBY Controller at 2010f
Sep 18 12:28:32 SLES-64BIT-SLOT2 osafimmd[2327]: NO Extended intro from node 
2010f
.......

SC-1 went for reboot because of #1109

Sep 18 12:29:40 SLES-64BIT-SLOT2 osaffmd[2317]: Rebooting OpenSAF NodeId = 
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId = 
131599, SupervisionTime = 60
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [  180.896027] TIPC: Resetting link 
<1.1.2:eth0-1.1.1:eth0>, peer not responding
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [  180.896032] TIPC: Lost link 
<1.1.2:eth0-1.1.1:eth0> on network plane A
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [  180.896034] TIPC: Lost contact with 
<1.1.1>
Sep 18 12:29:40 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the 
absence of PLM is outside the scope of OpenSAF


Health check callback timedout on NTF.

Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO SU failover probation 
timer started (timeout: 1200000000000 ns)
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO Performing failover of 
'safSu=SC-2,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO 
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' recovery action escalated from 
'componentFailover' to 'suFailover'
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO 
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to 
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: ER 
safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due 
to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: Rebooting OpenSAF NodeId = 
131599 EE Name = , Reason: Component faulted: recovery is node failfast, 
OwnNodeId = 131599, SupervisionTime = 60
Sep 18 12:33:54 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node; 
timeout=60
Sep 18 12:34:17 SLES-64BIT-SLOT2 syslog-ng[1139]: syslog-ng starting up; 
version='2.0.9'
Sep 18 12:34:18 SLES-64BIT-SLOT2 ifup:     lo

syslog and mds logs for both the controllers attached. NTFD traces on SC-2 
attached. 



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to