We are using opensaf 4.4.0. In one of our environments a cluster was up and running, then each node in the cluster was stopped and started with the opensafd script. All nodes restarted and rejoined the cluster successfully without errors except for one payload node (appbox4). The payload node would not rejoin the cluster even after appbox4 machine was rebooted. Each attempt to start the node resulted in the opensaf processes going defunct and hanging until opensafd stop command was executed. The resolution was to stop and start the entire cluster. This solution is not a good solution for a continuously available system. so we would like to know the following:
1) What could possibly cause this problem? 2) Is there another way of resolving this situation other than stopping and starting the entire cluster? We really would appreciate any suggestions or help with this issue. The payload, active controller, and standby controller messages logs contain the following for one such start attempt: Payload messages log: Feb 6 12:54:22 appbox4 opensafd: Starting OpenSAF Services Feb 6 12:54:22 appbox4 osafdtmd[63170]: Started Feb 6 12:54:22 appbox4 osafimmnd[63192]: Started Feb 6 12:54:22 appbox4 osafdtmd[63170]: NO Established contact with 'appbox1' Feb 6 12:54:22 appbox4 osafdtmd[63170]: NO Established contact with 'dbbox2' Feb 6 12:54:22 appbox4 osafdtmd[63170]: NO Established contact with 'appbox3' Feb 6 12:54:22 appbox4 osafdtmd[63170]: NO Established contact with 'dbbox1' Feb 6 12:54:22 appbox4 osafdtmd[63170]: NO Established contact with 'appbox2' Feb 6 12:54:22 appbox4 osafimmnd[63192]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING Feb 6 12:54:23 appbox4 osafimmnd[63192]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING Feb 6 12:54:23 appbox4 osafimmnd[63192]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING Feb 6 12:54:23 appbox4 osafimmnd[63192]: NO NODE STATE-> IMM_NODE_ISOLATED Feb 6 12:54:23 appbox4 osafimmnd[63192]: NO NODE STATE-> IMM_NODE_W_AVAILABLE Feb 6 12:54:23 appbox4 osafimmnd[63192]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT Feb 6 12:54:25 appbox4 osafimmnd[63192]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2316 Feb 6 12:54:25 appbox4 osafimmnd[63192]: NO RepositoryInitModeT is SA_IMM_INIT_FROM_FILE Feb 6 12:54:25 appbox4 osafimmnd[63192]: NO Epoch set to 27 in ImmModel Feb 6 12:54:25 appbox4 osafimmnd[63192]: NO SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM SERVER READY Feb 6 12:54:25 appbox4 osafclmna[63225]: Started Feb 6 12:54:25 appbox4 osafclmna[63225]: NO safNode=appbox4,safCluster=myClmCluster Joined cluster, nodeid=20d0f Feb 6 12:54:25 appbox4 osafamfnd[63240]: Started Here's where things hang and the opensaf processes go defunct on appbox4 and the opensafd stop command was executed Feb 6 12:57:57 appbox4 opensafd: Stopping OpenSAF Services we are not sure if it is significant or not but the last messages when the Node Director was ok are: Feb 6 10:16:39 appbox4 osafamfnd[13603]: ER ncsmds_api for 0 FAILED, dest=20d0f0000acd4 Feb 6 10:16:49 appbox4 osafamfnd[13603]: saImmOmInitialize FAILED, rc = 5 Active controller messages log: Feb 6 12:54:22 appbox3 osafdtmd[22105]: NO Established contact with 'appbox4' Feb 6 12:54:23 appbox3 osafimmd[22159]: NO Node 20d0f request sync sync-pid:63192 epoch:0 Feb 6 12:54:23 appbox3 osafimmnd[22175]: NO Announce sync, epoch:27 Feb 6 12:54:23 appbox3 osafimmnd[22175]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER Feb 6 12:54:23 appbox3 osafimmd[22159]: NO Successfully announced sync. New ruling epoch:27 Feb 6 12:54:23 appbox3 osafimmnd[22175]: NO NODE STATE-> IMM_NODE_R_AVAILABLE Feb 6 12:54:23 appbox3 osafimmloadd: NO Sync starting Feb 6 12:54:25 appbox3 osafimmloadd: IN Synced 3291 objects in total Feb 6 12:54:25 appbox3 osafimmnd[22175]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 15141 Feb 6 12:54:25 appbox3 osafimmloadd: NO Sync ending normally Feb 6 12:54:25 appbox3 osafimmnd[22175]: NO Epoch set to 27 in ImmModel Feb 6 12:54:25 appbox3 osafimmd[22159]: NO ACT: New Epoch for IMMND process at node 20b0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox3 osafimmd[22159]: NO ACT: New Epoch for IMMND process at node 20e0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox3 osafimmnd[22175]: NO SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM SERVER READY Feb 6 12:54:25 appbox3 osafimmd[22159]: NO ACT: New Epoch for IMMND process at node 20f0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox3 osafimmd[22159]: NO ACT: New Epoch for IMMND process at node 20a0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox3 osafimmd[22159]: NO ACT: New Epoch for IMMND process at node 20c0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox3 osafimmd[22159]: NO ACT: New Epoch for IMMND process at node 20d0f old epoch: 0 new epoch:27 Feb 6 12:54:25 appbox3 osafamfd[22260]: NO Node 'appbox4' joined the cluster Standby controller messages log: Feb 6 12:54:22 appbox1 osafdtmd[14345]: NO Established contact with 'appbox4' Feb 6 12:54:23 appbox1 osafimmd[14398]: NO SBY: Ruling epoch noted as:27 Feb 6 12:54:23 appbox1 osafimmd[14398]: NO IMMND coord at 20b0f Feb 6 12:54:23 appbox1 osafimmnd[14414]: NO NODE STATE-> IMM_NODE_R_AVAILABLE Feb 6 12:54:25 appbox1 osafimmnd[14414]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 15642 Feb 6 12:54:25 appbox1 osafimmnd[14414]: NO Epoch set to 27 in ImmModel Feb 6 12:54:25 appbox1 osafimmd[14398]: NO SBY: New Epoch for IMMND process at node 20b0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox1 osafimmd[14398]: NO IMMND coord at 20b0f Feb 6 12:54:25 appbox1 osafimmd[14398]: NO SBY: New Epoch for IMMND process at node 20e0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox1 osafimmd[14398]: NO SBY: New Epoch for IMMND process at node 20f0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox1 osafimmd[14398]: NO SBY: New Epoch for IMMND process at node 20a0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox1 osafimmd[14398]: NO SBY: New Epoch for IMMND process at node 20c0f old epoch: 26 new epoch:27 Feb 6 12:54:25 appbox1 osafimmd[14398]: NO SBY: New Epoch for IMMND process at node 20d0f old epoch: 0 new epoch:27 Shu Wang | Senior Analyst | +1(407)708-5117 or x3917| www.NetCracker.com Proven Partner to Communications Service Providers ________________________________ The information transmitted herein is intended only for the person or entity to which it is addressed and may contain confidential, proprietary and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. ------------------------------------------------------------------------------ Dive into the World of Parallel Programming. The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
