[tickets] [opensaf:tickets] #1514 Opensaf on payload failed to come up and IMMD on active controller faulted

Anders Bjornerstedt Thu, 15 Oct 2015 00:49:08 -0700

I see this as a duplicate of #1291, which is closed as invalid.
The basic problem is communication overload.


The only available current solution for deployments that see this issue is to 
reduce the 
value for the opensafImmSyncBatchSize config attribute in the OpensAF IMM 
service object:
opensafImm=opensafImm,safApp=safImmService

Beyond this, there are various enhancements, in MDS or OpenSAF that could 
potentially reduce
the risk of communication overload. 





---

** [tickets:#1514]  Opensaf on payload failed to come up and IMMD on active 
controller faulted**

**Status:** assigned
**Milestone:** 4.7.RC1
**Created:** Mon Oct 05, 2015 10:03 AM UTC by Ritu Raj
**Last Updated:** Wed Oct 07, 2015 12:37 PM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**

- 
[1513.tgz](https://sourceforge.net/p/opensaf/tickets/1514/attachment/1513.tgz) 
(7.1 MB; application/x-compressed-tar)


Setup:
Changeset- 6901
4 nodes configured with single PBE and a load of 30K objects

Issue observed
* Payload failed to join the cluster and  later active controller rebooted 

Steps performed:
* Started OpenSAF on the  controller SC-1 and SC-1 took the active role .

Oct  5 12:33:31 SLES-64BIT-SLOT1 osafrded[3129]: NO No peer available => 
Setting Active role for this node
 Later, started opensaf on slot-2, for which opensafd failed because of the 
disk size full. Resolved the issue and restarted the opensaf on slot-2, which 
ensured that both the nodes joined the cluster.

Oct  5 12:45:34 SLES-32BIT-SLOT2 osafrded[15186]: NO Peer rde@2010f has active 
state => Assigning Standby role to this node


* After controllers formed the cluster, later started opensaf on the remaining 
two payloads  at same time.
*  PL-3 joined the cluster successfully.
*  
Oct  5 13:03:19 SLES-64BIT-SLOT3 kernel: [495958.582544] TIPC: Own node address 
<1.1.3>, network identity 5234
....
Oct  5 13:09:34 SLES-64BIT-SLOT3 osafimmnd[15392]: NO NODE STATE-> 
IMM_NODE_FULLY_AVAILABLE 17601
Oct  5 13:09:34 SLES-64BIT-SLOT3 osafimmnd[15392]: NO Epoch set to 125 in 
ImmModel
Oct  5 13:09:35 SLES-64BIT-SLOT3 osafimmnd[15392]: NO Implementer (applier) 
connected: 27 (@OpenSafImmReplicatorB) <0, 2010f>


* PL-4  failed to join the cluster,

Oct  5 13:03:38 SLES-32BIT-SLOT4 kernel: [436326.659526] TIPC: Own node address 
<1.1.4>, network identity 5234
Oct  5 13:03:38 SLES-32BIT-SLOT4 osafimmnd[8781]: NO Persistent Back-End 
capability configured, Pbe file:imm.db (suffix may get added)
Oct  5 13:03:38 SLES-32BIT-SLOT4 osafimmnd[8781]: NO SERVER STATE: 
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Oct  5 13:03:43 SLES-32BIT-SLOT4 osafimmnd[8781]: WA Resending introduce-me - 
problems with MDS ? 5
Oct  5 13:03:43 SLES-32BIT-SLOT4 osafimmnd[8781]: WA Resending introduce-me - 
problems with MDS ? 5
...
Oct  5 13:04:28 SLES-32BIT-SLOT4 osafimmnd[8781]: WA Resending introduce-me - 
problems with MDS ? 50
Oct  5 13:04:29 SLES-32BIT-SLOT4 osafimmnd[8781]: ER Failed to load/sync. 
Giving up after 51 seconds, restarting..
Oct  5 13:04:29 SLES-32BIT-SLOT4 opensafd[8736]: ER Failed   DESC:IMMND
Oct  5 13:04:29 SLES-32BIT-SLOT4 opensafd[8736]: ER Going for recovery
...Oct  5 13:06:41 SLES-32BIT-SLOT4 osafimmnd[8856]: ER Failed to load/sync. 
Giving up after 51 seconds, restarting..
Oct  5 13:06:41 SLES-32BIT-SLOT4 opensafd[8736]: ER Could Not RESPAWN IMMND
Oct  5 13:06:41 SLES-32BIT-SLOT4 opensafd[8736]: ER Failed   DESC:IMMND
Oct  5 13:06:41 SLES-32BIT-SLOT4 opensafd[8736]: ER FAILED TO RESPAWN
Oct  5 13:06:41 SLES-32BIT-SLOT4 osafimmnd[8856]: ER IMMND - Periodic server 
job failed
Oct  5 13:06:41 SLES-32BIT-SLOT4 osafimmnd[8856]: ER Failed, exiting...
Oct  5 13:06:41 SLES-32BIT-SLOT4 kernel: [436509.187946] TIPC: Disabling bearer 
<eth:eth0>

 * After the opensafd failed to come up on PL-4, SC-1 rebooted with IMMD 
exiting.

Oct  5 13:08:52 SLES-64BIT-SLOT1 osafimmnd[3163]: NO Coord broadcasting 
PBE_PRTO_PURGE_MUTATIONS, epoch:123
Oct  5 13:08:53 SLES-64BIT-SLOT1 osafimmnd[3163]: NO ImmModel::getPbeOi reports 
missing PbeOi locally => unsafe
Oct  5 13:08:53 SLES-64BIT-SLOT1 osafimmnd[3163]: NO Coord broadcasting 
PBE_PRTO_PURGE_MUTATIONS, epoch:123
Oct  5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: NO SU failover probation 
timer started (timeout: 1200000000000 ns)
Oct  5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: NO Performing failover of 
'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Oct  5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: NO 
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated 
from 'componentFailover' to 'suFailover'
Oct  5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: NO 
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Oct  5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: ER 
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due 
to:healthCheckcallbackTimeout Recovery is:suFailover


* PL-4 joined the cluster, after opensafd is started on PL-4 after some time.


---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1514 Opensaf on payload failed to come up and IMMD on active controller faulted

Reply via email to