I see this as a duplicate of #1291, which is closed as invalid.
The basic problem is communication overload.
The only available current solution for deployments that see this issue is to
reduce the
value for the opensafImmSyncBatchSize config attribute in the OpensAF IMM
service object:
opensafImm=opensafImm,safApp=safImmService
Beyond this, there are various enhancements, in MDS or OpenSAF that could
potentially reduce
the risk of communication overload.
---
** [tickets:#1514] Opensaf on payload failed to come up and IMMD on active
controller faulted**
**Status:** assigned
**Milestone:** 4.7.RC1
**Created:** Mon Oct 05, 2015 10:03 AM UTC by Ritu Raj
**Last Updated:** Wed Oct 07, 2015 12:37 PM UTC
**Owner:** A V Mahesh (AVM)
**Attachments:**
-
[1513.tgz](https://sourceforge.net/p/opensaf/tickets/1514/attachment/1513.tgz)
(7.1 MB; application/x-compressed-tar)
Setup:
Changeset- 6901
4 nodes configured with single PBE and a load of 30K objects
Issue observed
* Payload failed to join the cluster and later active controller rebooted
Steps performed:
* Started OpenSAF on the controller SC-1 and SC-1 took the active role .
Oct 5 12:33:31 SLES-64BIT-SLOT1 osafrded[3129]: NO No peer available =>
Setting Active role for this node
Later, started opensaf on slot-2, for which opensafd failed because of the
disk size full. Resolved the issue and restarted the opensaf on slot-2, which
ensured that both the nodes joined the cluster.
Oct 5 12:45:34 SLES-32BIT-SLOT2 osafrded[15186]: NO Peer rde@2010f has active
state => Assigning Standby role to this node
* After controllers formed the cluster, later started opensaf on the remaining
two payloads at same time.
* PL-3 joined the cluster successfully.
*
Oct 5 13:03:19 SLES-64BIT-SLOT3 kernel: [495958.582544] TIPC: Own node address
<1.1.3>, network identity 5234
....
Oct 5 13:09:34 SLES-64BIT-SLOT3 osafimmnd[15392]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE 17601
Oct 5 13:09:34 SLES-64BIT-SLOT3 osafimmnd[15392]: NO Epoch set to 125 in
ImmModel
Oct 5 13:09:35 SLES-64BIT-SLOT3 osafimmnd[15392]: NO Implementer (applier)
connected: 27 (@OpenSafImmReplicatorB) <0, 2010f>
* PL-4 failed to join the cluster,
Oct 5 13:03:38 SLES-32BIT-SLOT4 kernel: [436326.659526] TIPC: Own node address
<1.1.4>, network identity 5234
Oct 5 13:03:38 SLES-32BIT-SLOT4 osafimmnd[8781]: NO Persistent Back-End
capability configured, Pbe file:imm.db (suffix may get added)
Oct 5 13:03:38 SLES-32BIT-SLOT4 osafimmnd[8781]: NO SERVER STATE:
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Oct 5 13:03:43 SLES-32BIT-SLOT4 osafimmnd[8781]: WA Resending introduce-me -
problems with MDS ? 5
Oct 5 13:03:43 SLES-32BIT-SLOT4 osafimmnd[8781]: WA Resending introduce-me -
problems with MDS ? 5
...
Oct 5 13:04:28 SLES-32BIT-SLOT4 osafimmnd[8781]: WA Resending introduce-me -
problems with MDS ? 50
Oct 5 13:04:29 SLES-32BIT-SLOT4 osafimmnd[8781]: ER Failed to load/sync.
Giving up after 51 seconds, restarting..
Oct 5 13:04:29 SLES-32BIT-SLOT4 opensafd[8736]: ER Failed DESC:IMMND
Oct 5 13:04:29 SLES-32BIT-SLOT4 opensafd[8736]: ER Going for recovery
...Oct 5 13:06:41 SLES-32BIT-SLOT4 osafimmnd[8856]: ER Failed to load/sync.
Giving up after 51 seconds, restarting..
Oct 5 13:06:41 SLES-32BIT-SLOT4 opensafd[8736]: ER Could Not RESPAWN IMMND
Oct 5 13:06:41 SLES-32BIT-SLOT4 opensafd[8736]: ER Failed DESC:IMMND
Oct 5 13:06:41 SLES-32BIT-SLOT4 opensafd[8736]: ER FAILED TO RESPAWN
Oct 5 13:06:41 SLES-32BIT-SLOT4 osafimmnd[8856]: ER IMMND - Periodic server
job failed
Oct 5 13:06:41 SLES-32BIT-SLOT4 osafimmnd[8856]: ER Failed, exiting...
Oct 5 13:06:41 SLES-32BIT-SLOT4 kernel: [436509.187946] TIPC: Disabling bearer
<eth:eth0>
* After the opensafd failed to come up on PL-4, SC-1 rebooted with IMMD
exiting.
Oct 5 13:08:52 SLES-64BIT-SLOT1 osafimmnd[3163]: NO Coord broadcasting
PBE_PRTO_PURGE_MUTATIONS, epoch:123
Oct 5 13:08:53 SLES-64BIT-SLOT1 osafimmnd[3163]: NO ImmModel::getPbeOi reports
missing PbeOi locally => unsafe
Oct 5 13:08:53 SLES-64BIT-SLOT1 osafimmnd[3163]: NO Coord broadcasting
PBE_PRTO_PURGE_MUTATIONS, epoch:123
Oct 5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: NO SU failover probation
timer started (timeout: 1200000000000 ns)
Oct 5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: NO Performing failover of
'safSu=SC-1,safSg=2N,safApp=OpenSAF' (SU failover count: 1)
Oct 5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' recovery action escalated
from 'componentFailover' to 'suFailover'
Oct 5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: NO
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to
'healthCheckcallbackTimeout' : Recovery is 'suFailover'
Oct 5 13:08:53 SLES-64BIT-SLOT1 osafamfnd[3239]: ER
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due
to:healthCheckcallbackTimeout Recovery is:suFailover
* PL-4 joined the cluster, after opensafd is started on PL-4 after some time.
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets