Sounds like SMFD or a library linked into it is leaking file descriptors...
---
** [tickets:#1486] smf : SMFD asserted in csi active callback during
switchovers ( ncs_sel_obj_create: socketpair failed )**
**Status:** unassigned
**Milestone:** 4.7.2
**Created:** Wed Sep 16, 2015 10:04 AM UTC by Ritu Raj
**Last Updated:** Fri Sep 16, 2016 08:31 AM UTC
**Owner:** nobody
Setup
4.6GA with changeset 6490
4 nodes(OEL6.4 with TIPC version 1.7.7) configured with no PBE configured
Issues Observed:
> Cluser went for reboot during switchover as SMFD faulted due to
'csiSetcallbackFailed'
Steps Performed:
* Continuous switchovers are invoked on the setup.
* After a count of over 1000 switchovers, Standby Controller (SC-2) got
rebooted when it is being promoted to ACTIVE state , as SMFD failed in active
callback.
Sep 16 06:25:00 SLOT-2 osafsmfd[1926]: ER amf_active_state_handler oi activate
FAIL
Sep 16 06:25:00 SLOT-2 osafamfnd[1802]: NO
'safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF' faulted due to
'csiSetcallbackFailed' : Recovery is 'nodeFailfast'
Sep 16 06:25:00 SLOT-2 osafamfnd[1802]: ER
safComp=SMF,safSu=SC-2,safSg=2N,safApp=OpenSAF Faulted due
to:csiSetcallbackFailed Recovery is:nodeFailfast
Sep 16 06:25:00 SLOT-2 osafamfnd[1802]: Rebooting OpenSAF NodeId = 131599 EE
Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId =
131599, SupervisionTime = 60
* After SC-2 went for reboot, SC-1 tried to become active, during which smfd
also faulted on the new promoted back active controller.
Sep 16 06:25:00 SLOT-1 root: Invoking switchover from invoke_switchover.sh
Sep 16 06:25:00 SLOT-1 osafamfd[3830]: NO safSi=SC-2N,safApp=OpenSAF Swap
initiated
Sep 16 06:25:00 SLOT-1 osafamfnd[3845]: NO Assigning
'safSi=SC-2N,safApp=OpenSAF' QUIESCED to 'safSu=SC-1,safSg=2N,safApp=OpenSAF'
Sep 16 06:25:00 SLOT-1 osafsmfd[3871]: ncs_sel_obj_create: socketpair failed -
Too many open files
....
Sep 16 06:25:05 SLOT-1 kernel: TIPC: Resetting link <1.1.1:eth0-1.1.2:eth1>,
peer not responding
Sep 16 06:25:05 SLOT-1 kernel: TIPC: Lost link <1.1.1:eth0-1.1.2:eth1> on
network plane A
Sep 16 06:25:05 SLOT-1 kernel: TIPC: Lost contact with <1.1.2>
Sep 16 06:25:05 SLOT-1 osaffmd[3716]: NO Node Down event for node id 2020f:
....
Sep 16 06:25:06 SLOT-1 osafimmnd[3746]: NO This IMMND re-elected coord
redundantly, failover ?
Sep 16 06:25:06 SLOT-1 osafsmfd[3871]: ncs_sel_obj_create: socketpair failed -
Too many open files
Sep 16 06:25:06 SLOT-1 osafsmfd[3871]: ER immutil_saImmOiInitialize_2 fail, rc
= 2
...
Sep 16 06:25:06 SLOT-1 osafamfnd[3845]: ER
safComp=SMF,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:csiSetcallbackFailed Recovery is:nodeFailfast
Sep 16 06:25:06 SLOT-1 osafamfnd[3845]: Rebooting OpenSAF NodeId = 131343 EE
Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId =
131343, SupervisionTime = 60
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets