- **status**: unassigned --> duplicate
- **Milestone**: 4.4.2 --> never
- **Comment**:
Closing the ticket as duplicate of #1486, as in both the scenarios SMFD faulted
due to the following issue.
ncs_sel_obj_create: socketpair failed - Too many open files
---
** [tickets:#1322] cluster went for reboot after a count of over 1000
switchovers.**
**Status:** duplicate
**Milestone:** never
**Created:** Tue Apr 21, 2015 06:26 AM UTC by Srikanth R
**Last Updated:** Tue Apr 21, 2015 06:26 AM UTC
**Owner:** nobody
**Attachments:**
-
[logs.tgz](https://sourceforge.net/p/opensaf/tickets/1322/attachment/logs.tgz)
(437.3 kB; application/x-compressed-tar)
Setup : opensaf configured with changeset 6377 on 5 nodes.
Issue : cluster went for reboot after a count of over 1000 switchovers.
Steps :
-> A script invokes switch overs with a time gap of 45 seconds continuously.
-> After a count of 1005 switchovers, cluster went for reboot owing to the
smfd crash on both the controllers.
-> At the start of the last two switchovers, SC-2 is the active and SC-1 is
the standby
On the new active .i.e SC-1, smfd issues are observed, but there is no failure
from switchover persepective
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer connected:
11118 (safSmfService) <9602, 2010f>
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafsmfd[4069]: ncs_sel_obj_create: socketpair
failed - Too many open files
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafsmfd[4069]: ER SMFA: MBX create FAILED.
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer (applier)
connected: 11119 (@OpenSafImmReplicatorA) <9610, 2010f>
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafntfimcnd[24404]: NO Started
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafsmfd[4069]: ER
SmfCbkUtilThread::initSmfCbkApi saSmfInitialize failed, rc=SA_AIS_ERR_LIBRARY
(2)
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafsmfd[4069]: ER initSmfCbkApi failed
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafsmfd[4069]: ER SmfCbkUtilThread init
failed, terminating thread
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafamfnd[4052]: NO Assigned
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 21 06:55:39 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11109 <0, 2020f> (@OpenSafImmReplicatorB)
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer (applier)
connected: 11120 (@OpenSafImmReplicatorB) <0, 2020f>
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11111 <0, 2020f> (safAmfService)
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer (applier)
connected: 11121 (@safAmfService2020f) <0, 2020f>
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafamfd[4042]: NO Switching StandBy --> Active
State
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11110 <18, 2010f> (@safAmfService2010f)
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer connected:
11122 (safAmfService) <18, 2010f>
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafrded[3950]: NO RDE role set to ACTIVE
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafclmd[4019]: NO ACTIVE request
Apr 21 06:55:40 SYSTEST-CNTLR-1 osafamfd[4042]: NO Assigning due to dep
'safSi=NOREDSI1,safApp=NOREDAPP'
Apr 21 06:55:41 SYSTEST-CNTLR-1 osafamfd[4042]: NO Assigning due to dep
'safSi=NPMSI1,safApp=NPMAPP'
Apr 21 06:55:41 SYSTEST-CNTLR-1 osafamfd[4042]: NO Assigning due to dep
'safSi=NWAYACTSI1,safApp=NWAYACTAPP'
Apr 21 06:55:41 SYSTEST-CNTLR-1 osafamfd[4042]: NO Assigning due to dep
'safSi=NWAYSI1,safApp=NWAYAPP'
Apr 21 06:55:41 SYSTEST-CNTLR-1 osafamfd[4042]: NO Assigning due to dep
'safSi=TWONSI1,safApp=TWONAPP'
Apr 21 06:55:41 SYSTEST-CNTLR-1 osafamfd[4042]: NO Controller switch over done
At this stage, SC-1 is the active and SC-2 is the standby and when switchover
is invoked, smfd on SC-2 crashed ( on the new active SC-2) and node went for
reboot
Apr 21 06:56:45 SYSTEST-CNTLR-2 osafimmnd[12333]: NO Implementer connected:
11127 (safEvtService) <684, 2020f>
Apr 21 06:56:45 SYSTEST-CNTLR-2 osafimmnd[12333]: NO Implementer connected:
11128 (safClmService) <5, 2020f>
Apr 21 06:56:45 SYSTEST-CNTLR-2 osafsmfd[12523]: ER amf_active_state_handler oi
activate FAIL
Apr 21 06:56:45 SYSTEST-CNTLR-2 osaflogd[12343]: WA read_logsv_configuration().
All attributes could not be read
Apr 21 06:56:45 SYSTEST-CNTLR-2 osafamfnd[12399]: NO
'safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
'csiSetcallbackFailed' : Recovery is 'nodeFailfast'
Apr 21 06:56:45 SYSTEST-CNTLR-2 osafamfnd[12399]: ER
safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due
to:csiSetcallbackFailed Recovery is:nodeFailfast
Apr 21 06:56:45 SYSTEST-CNTLR-2 osafamfnd[12399]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131599, SupervisionTime = 60
Apr 21 06:56:45 SYSTEST-CNTLR-2 opensaf_reboot: Rebooting local node; timeout=60
The quiesced SC-1 could not promote it self to active and the smfd crashed and
the entire node went for reboot.
Apr 21 06:55:41 SYSTEST-CNTLR-1 osafamfd[4042]: NO Controller switch over done
Apr 21 06:56:22 SYSTEST-CNTLR-1 osafamfd[4042]: NO safSi=SC-2N,safApp=OpenSAF
Swap initiated
Apr 21 06:56:22 SYSTEST-CNTLR-1 osafamfnd[4052]: NO Assigning
'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 21 06:56:22 SYSTEST-CNTLR-1 osafsmfd[4069]: ncs_sel_obj_create: socketpair
failed - Too many open files
Apr 21 06:56:22 SYSTEST-CNTLR-1 osafsmfd[4069]: NO immutil_saImmOmFinalize
fail, rc = 2, continue
Apr 21 06:56:22 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11112 <692, 2010f> (safMsgGrpService)
Apr 21 06:56:22 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11116 <694, 2010f> (safEvtService)
Apr 21 06:56:22 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11113 <4, 2010f> (safLogService)
Apr 21 06:56:23 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11114 <695, 2010f> (safCheckPointService)
Apr 21 06:56:23 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11115 <693, 2010f> (safLckService)
Apr 21 06:56:23 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11117 <12, 2010f> (safClmService)
Apr 21 06:56:24 SYSTEST-CNTLR-1 osafamfnd[4052]: NO Assigned
'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 21 06:56:24 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer disconnected
11120 <0, 2020f> (@OpenSafImmReplicatorB)
Apr 21 06:56:24 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer connected:
11123 (safMsgGrpService) <0, 2020f>
Apr 21 06:56:24 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer connected:
11124 (safLogService) <0, 2020f>
Apr 21 06:56:24 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer connected:
11125 (safCheckPointService) <0, 2020f>
Apr 21 06:56:24 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer connected:
11126 (safLckService) <0, 2020f>
Apr 21 06:56:24 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer connected:
11127 (safEvtService) <0, 2020f>
Apr 21 06:56:24 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer connected:
11128 (safClmService) <0, 2020f>
Apr 21 06:56:25 SYSTEST-CNTLR-1 osafimmnd[3979]: NO Implementer (applier)
connected: 11129 (@OpenSafImmReplicatorB) <0, 2020f>
Apr 21 06:56:30 SYSTEST-CNTLR-1 osaffmd[3959]: NO Node Down event for node id
2020f:
Apr 21 06:56:30 SYSTEST-CNTLR-1 osaffmd[3959]: NO Current role: ACTIVE
Apr 21 06:56:30 SYSTEST-CNTLR-1 osaffmd[3959]: Rebooting OpenSAF NodeId =
131599 EE Name = , Reason: Received Node Down for peer controller, OwnNodeId =
131343, SupervisionTime = 60
Apr 21 06:56:30 SYSTEST-CNTLR-1 kernel: [45736.080046] TIPC: Resetting link
<1.1.1:eth1-1.1.2:eth2>, peer not responding
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafamfnd[4052]: NO Assigning
'safSi=SC-2N,safApp=OpenSAF' ACTIVE to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafsmfd[4069]: ncs_sel_obj_create: socketpair
failed - Too many open files
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafsmfd[4069]: ER immutil_saImmOiInitialize_2
fail, rc = 2
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafsmfd[4069]: ER campaign_oi_init FAIL
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafsmfd[4069]: ncs_sel_obj_create: socketpair
failed - Too many open files
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafsmfd[4069]: ER Could not get IMM config
object from IMM opensafImm=opensafImm,safApp=safImmService
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafsmfd[4069]: ER
read_IMM_long_DN_config_and_set_control_block FAIL
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafsmfd[4069]: ER
read_config_and_set_control_block FAIL
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafsmfd[4069]: ER amf_active_state_handler oi
activate FAIL
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafamfnd[4052]: NO
'safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'csiSetcallbackFailed' : Recovery is 'nodeFailfast'
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafamfnd[4052]: ER
safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:csiSetcallbackFailed Recovery is:nodeFailfast
Apr 21 06:56:30 SYSTEST-CNTLR-1 osafamfnd[4052]: Rebooting OpenSAF NodeId =
131343 EE Name = , Reason: Component faulted: recovery is node failfast,
OwnNodeId = 131343, SupervisionTime = 60
As both controllers went for reboot, payloads went for reboot.
SMFD traces are not available. Attaching the syslog of the controllers.
IMMND traces are huge and can be shared, on request
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets