An analysis and some more info:
- Current SC-1 is active, SC-2 is standby.
- Power off SC-1 so SC-2 is FAILOVER standby -> active.
At this time, NTF service on SC-2 try to terminate the child process
osafntfimcnd. But the terminating can not successfully.
So NTF service on SC-2 was hang.
And after some minutes past, the NTF healthcheck callback time out and make
SC-2 reboot.
On SC-1:
Sep 12 12:53:56: Power off active SC-1
Syslog SC-2:
Sep 12 12:53:56 SC-2 osafamfd[10248]: NO FAILOVER StandBy --> Active
Sep 12 12:53:56 SC-2 osafimmd[10159]: NO ellect_coord invoke from rda_callback
ACTIVE
Sep 12 12:53:56 SC-2 osafimmd[10159]: NO New coord elected, resides at 2020f
Sep 12 12:53:56 SC-2 osafimmnd[10169]: NO This IMMND is now the NEW Coord
....
....
Sep 12 12:58:45 SC-2 osafamfnd[10261]: NO Performing failover of
'safSu=SC-2,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Sep 12 12:58:45 SC-2 osafamfnd[10261]: NO
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' recovery action escalated from
'componentFailover' to 'suFailover'
Sep 12 12:58:45 SC-2 osafamfnd[10261]: NO
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 12 12:58:45 SC-2 osafamfnd[10261]: ER
safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 12 12:58:45 SC-2 osafamfnd[10261]: Rebooting OpenSAF NodeId = 131599 EE
Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId =
131599, SupervisionTime = 60
Sep 12 12:58:45 SC-2 opensaf_reboot: Rebooting local node; timeout=60
Sep 12 12:59:15 SC-2 syslog-ng[3095]: syslog-ng starting up; version='2.0.9'
NTF trace SC-2:
Sep 12 12:53:57.072635 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0630]
TR CSISet: CSIName = , CSIFlags = 4, HA state = 1
Sep 12 12:53:57.072641 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0645]
TR CSISet: Active Transition Descriptor = 1, Active Component Name =
Sep 12 12:53:57.072646 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0648]
TR CSISet: ActiveCompName = , StandbyRank = 0
Sep 12 12:53:57.072652 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0650]
TR Invoking component's saAmfCSISetCallback: InvocationId = f8e00001, component
name = safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF
Sep 12 12:53:57.072657 osafntfd
[10192:../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfs/ntfs_amf.c:0174]
>> amf_csi_set_callback
Sep 12 12:53:57.072663 osafntfd
[10192:../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfs/ntfs_amf.c:0042]
>> amf_active_state_handler: HA ACTIVE request
Sep 12 12:53:57.072669 osafntfd
[10192:../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfs/ntfs_amf.c:0046]
<< amf_active_state_handler
Sep 12 12:53:57.072685 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_api.c:1836]
>> saAmfResponse: SaAmfHandleT passed is ff000001
Sep 12 12:53:57.072691 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0852]
>> ava_hdl_pend_resp_get
Sep 12 12:53:57.072696 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_hdl.c:0868]
<< ava_hdl_pend_resp_get
Sep 12 12:53:57.072701 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0339]
>> ava_mds_send
Sep 12 12:53:57.072706 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0690]
>> ava_mds_msg_async_send
Sep 12 12:53:57.072714 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0179]
>> ava_mds_cbk
Sep 12 12:53:57.072719 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0493]
>> ava_mds_flat_enc
Sep 12 12:53:57.072726 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0510]
<< ava_mds_flat_enc: retval = 1
Sep 12 12:53:57.072731 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0242]
TR MDS flat encode callback success
Sep 12 12:53:57.072736 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0316]
<< ava_mds_cbk
Sep 12 12:53:57.072754 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0715]
<< ava_mds_msg_async_send: retval = 1
Sep 12 12:53:57.072761 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0367]
TR AVA MDS send success
Sep 12 12:53:57.072766 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_mds.c:0369]
<< ava_mds_send
Sep 12 12:53:57.072785 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_api.c:1918]
TR Callback resonse completed
Sep 12 12:53:57.072791 osafntfd
[10192:../../../../../../../opensaf/osaf/libs/agents/saf/amfa/ava_api.c:1934]
<< saAmfResponse: rc:1
Sep 12 12:53:57.072797 osafntfd
[10192:../../../../../../../opensaf/osaf/services/saf/ntfsv/ntfs/ntfs_imcnutil.c:0180]
TR handle_state_ntfimcn: Terminating osafntfimcnd process
---
** [tickets:#1110] NTF healthcheck callback timedout leading to node reboot**
**Status:** unassigned
**Milestone:** 4.3.3
**Created:** Thu Sep 18, 2014 07:41 AM UTC by Sirisha Alla
**Last Updated:** Fri Sep 19, 2014 10:34 AM UTC
**Owner:** Praveen
This issue is in continuation to ticket #1109.
During failover, the node that went for reboot failed to come up due to #1109.
Just then NTF health check callback timeout happened on the then Active
Controller leading to cluster reset.
Syslog of SC-2:
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafamfd[2391]: NO FAILOVER StandBy --> Active
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmd[2327]: NO ellect_coord invoke from
rda_callback ACTIVE
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmd[2327]: NO New coord elected, resides
at 2020f
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO This IMMND is now the NEW
Coord
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO PBE writing when new coord
elected => force PBE to regenerate db file
Sep 18 12:28:01 SLES-64BIT-SLOT2 osafimmnd[2337]: NO STARTING PBE process.
.....
Sep 18 12:28:11 SLES-64BIT-SLOT2 osafamfnd[2401]: NO Assigned
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-2,safSg=2N,safApp=OpenSAF'
Sep 18 12:28:21 SLES-64BIT-SLOT2 osafamfd[2391]: ER
sendStateChangeNotificationAvd: saNtfNotificationSend Failed (5)
Sep 18 12:28:31 SLES-64BIT-SLOT2 kernel: [ 111.656926] TIPC: Established link
<1.1.2:eth0-1.1.1:eth0> on network plane A
Sep 18 12:28:32 SLES-64BIT-SLOT2 osafimmd[2327]: NO New IMMND process is on
STANDBY Controller at 2010f
Sep 18 12:28:32 SLES-64BIT-SLOT2 osafimmd[2327]: NO Extended intro from node
2010f
.......
SC-1 went for reboot because of #1109
Sep 18 12:29:40 SLES-64BIT-SLOT2 osaffmd[2317]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131599, SupervisionTime = 60
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [ 180.896027] TIPC: Resetting link
<1.1.2:eth0-1.1.1:eth0>, peer not responding
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [ 180.896032] TIPC: Lost link
<1.1.2:eth0-1.1.1:eth0> on network plane A
Sep 18 12:29:40 SLES-64BIT-SLOT2 kernel: [ 180.896034] TIPC: Lost contact with
<1.1.1>
Sep 18 12:29:40 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting remote node in the
absence of PLM is outside the scope of OpenSAF
Health check callback timedout on NTF.
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO SU failover probation
timer started (timeout: 1200000000000 ns)
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO Performing failover of
'safSu=SC-2,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' recovery action escalated from
'componentFailover' to 'suFailover'
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: NO
'safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: ER
safComp=NTF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
Sep 18 12:33:54 SLES-64BIT-SLOT2 osafamfnd[2401]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131599, SupervisionTime = 60
Sep 18 12:33:54 SLES-64BIT-SLOT2 opensaf_reboot: Rebooting local node;
timeout=60
Sep 18 12:34:17 SLES-64BIT-SLOT2 syslog-ng[1139]: syslog-ng starting up;
version='2.0.9'
Sep 18 12:34:18 SLES-64BIT-SLOT2 ifup: lo
syslog and mds logs for both the controllers attached. NTFD traces on SC-2
attached.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Meet PCI DSS 3.0 Compliance Requirements with EventLog Analyzer
Achieve PCI DSS 3.0 Compliant Status with Out-of-the-box PCI DSS Reports
Are you Audit-Ready for PCI DSS 3.0 Compliance? Download White paper
Comply to PCI DSS 3.0 Requirement 10 and 11.5 with EventLog Analyzer
http://pubads.g.doubleclick.net/gampad/clk?id=154622311&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets