Thanks Neel. Is this fix is in 4.3.2 ? On Mar 5, 2014, at 7:42 AM, Neelakanta Reddy <[email protected]<mailto:[email protected]>> wrote:
Hi, The similar problem is fixed in http://sourceforge.net/p/opensaf/tickets/600/. The patch is pushed in changeset: 4688 for 4.3.x. Apply the patch and retest. If you still see the problem, please share the following logs: 1. amfd and amfnd traces of controllers and the payload 2. syslog of controllers and payload. 3. mds.log for controllers and payload. /Neel. On Wednesday 05 March 2014 05:25 PM, Tony Hart wrote: 5 seconds The payload card gets the TIPC timeout logs, but it does not reboot. This maybe timing related since the link re-establishes quickly after the down (you can see from the logs that the link re-established within the same second of going down). On Mar 5, 2014, at 6:51 AM, Neelakanta Reddy <[email protected]<mailto:[email protected]>> wrote: HI, what is the configured TIPC link tolerance time? Depending on the tolerance time, the other node will get service down. /Neel. On Tuesday 04 March 2014 08:53 PM, Tony Hart wrote: We’re seeing a problem where there is a loss of connectivity between a payload (cmm02B) and the controller (the connectivity returns but is away just long enough to trigger a TIPC timeout) in this case the payload is dropped from the cluster but the payload doesn’t restart. The payload is flagged as not being in the cluster and its presence state is UNINSTANTIATED. Its still running the osaf processes though. Is this something that’s been fixed in the current release (we’re running 4.3.1) $ immlist safNode=cmm02b,safCluster=myClmCluster Name Type Value(s) ======================================================================== safNode SA_STRING_T safNode=cmm02b saClmNodeLockCallbackTimeout SA_TIME_T 50000000000 (0xba43b7400, Thu Jan 1 00:00:50 1970) saClmNodeIsMember SA_UINT32_T 0 (0x0) saClmNodeInitialViewNumber SA_UINT64_T 28 (0x1c) saClmNodeID SA_UINT32_T 73743 (0x1200f) saClmNodeEE SA_NAME_T <Empty> saClmNodeDisableReboot SA_UINT32_T 0 (0x0) saClmNodeCurrAddressFamily SA_UINT32_T <Empty> saClmNodeCurrAddress SA_STRING_T <Empty> saClmNodeBootTimeStamp SA_TIME_T 1393879646000000000 (0x13580e27277a2c00, Mon Mar 3 20:47:26 2014) saClmNodeAdminState SA_UINT32_T 1 (0x1) saClmNodeAddressFamily SA_UINT32_T <Empty> saClmNodeAddress SA_STRING_T <Empty> SaImmAttrImplementerName SA_STRING_T safClmService SaImmAttrClassName SA_STRING_T SaClmNode SaImmAttrAdminOwnerName SA_STRING_T IMMLOADER $ immlist $(amf-find node | grep CMM02B) Name Type Value(s) ======================================================================== safAmfNode SA_STRING_T safAmfNode=CMM02B saAmfNodeSuFailoverMax SA_UINT32_T 2 (0x2) saAmfNodeSuFailOverProb SA_TIME_T 1200000000000 (0x1176592e000, Thu Jan 1 00:20:00 1970) saAmfNodeOperState SA_UINT32_T 2 (0x2) saAmfNodeFailfastOnTerminationFailure SA_UINT32_T 0 (0x0) saAmfNodeFailfastOnInstantiationFailure SA_UINT32_T 0 (0x0) saAmfNodeClmNode SA_NAME_T safNode=cmm02b,safCluster=myClmCluster (38) saAmfNodeCapacity SA_STRING_T <Empty> saAmfNodeAutoRepair SA_UINT32_T 1 (0x1) saAmfNodeAdminState SA_UINT32_T 1 (0x1) SaImmAttrImplementerName SA_STRING_T safAmfService SaImmAttrClassName SA_STRING_T SaAmfNode SaImmAttrAdminOwnerName SA_STRING_T IMMLOADER cmm02b$ ps aux | grep osaf root 1417 0.0 0.0 225880 2028 ? Ssl Mar03 0:08 /usr/lib64/opensaf/osafamfnd osafamfnd root 1429 0.0 0.0 157100 1416 ? Ssl Mar03 0:00 /usr/lib64/opensaf/osafsmfnd osafsmfnd opensaf 1438 0.0 0.1 174256 5764 ? Ssl Mar03 0:00 /usr/lib64/opensaf/osafmsgnd osafmsgnd opensaf 1454 0.0 0.0 155732 1448 ? Ssl Mar03 0:00 /usr/lib64/opensaf/osaflcknd osaflcknd opensaf 1463 0.0 0.0 158148 2296 ? Ssl Mar03 0:00 /usr/lib64/opensaf/osafckptnd osafckptnd opensaf 1472 0.0 0.0 155020 1392 ? Ssl Mar03 0:02 /usr/lib64/opensaf/osafamfwd osafamfwd opensaf 4704 0.0 0.3 182240 11992 ? Ssl 14:20 0:01 /usr/lib64/opensaf/osafimmnd osafimmnd SCM1 (1.1.15) ------------------- 2014-03-04T14:20:18.808187+00:00 scm1 osafamfd[1771]: NO Node 'PLD0211' left the cluster 2014-03-04T14:20:18.851318+00:00 scm1 kernel: TIPC: Established link <1.1.15:eth2-1.1.27:bond0> on network plane A 2014-03-04T14:20:18.852749+00:00 scm1 osafsmfd[1965]: ER saClmClusterNodeGet failed, rc=SA_AIS_ERR_NOT_EXIST (12) 2014-03-04T14:20:18.858472+00:00 scm1 kernel: TIPC: Established link <1.1.15:eth2-1.1.23:bond0> on network plane A 2014-03-04T14:20:18.871084+00:00 scm1 osafsmfd[1965]: ER saClmClusterNodeGet failed, rc=SA_AIS_ERR_NOT_EXIST (12) 2014-03-04T14:20:18.956307+00:00 scm1 kernel: TIPC: Resetting link <1.1.15:eth2-1.1.32:eth2>, peer not responding 2014-03-04T14:20:18.956330+00:00 scm1 kernel: TIPC: Lost link <1.1.15:eth2-1.1.32:eth2> on network plane A 2014-03-04T14:20:18.956335+00:00 scm1 kernel: TIPC: Lost contact with <1.1.32> 2014-03-04T14:20:18.956340+00:00 scm1 kernel: TIPC: Established link <1.1.15:eth2-1.1.32:eth2> on network plane A 2014-03-04T14:20:18.958227+00:00 scm1 osafimmnd[1667]: NO Global discard node received for nodeId:1200f pid:1347 2014-03-04T14:20:18.958270+00:00 scm1 osafimmnd[1667]: NO Implementer disconnected 51 <0, 1200f(down)> (MsgQueueService73743) 2014-03-04T14:20:18.965240+00:00 scm1 osafimmnd[1667]: NO Implementer connected: 71 (MsgQueueService73743) <92377, 10f0f> 2014-03-04T14:20:18.968251+00:00 scm1 osafimmnd[1667]: NO Implementer locally disconnected. Marking it as doomed 71 <92377, 10f0f> (MsgQueueService73743) 2014-03-04T14:20:18.971785+00:00 scm1 osafimmnd[1667]: NO Global discard node received for nodeId:1170f pid:0 2014-03-04T14:20:18.973013+00:00 scm1 osafimmnd[1667]: NO Global discard node received for nodeId:1200f pid:0 2014-03-04T14:20:18.976586+00:00 scm1 osafimmnd[1667]: NO Implementer disconnected 71 <92377, 10f0f> (MsgQueueService73743) 2014-03-04T14:20:19.025760+00:00 scm1 osafimmd[1657]: NO Node 11e0f request sync sync-pid:23769 epoch:0 2014-03-04T14:20:19.076427+00:00 scm1 osafamfd[1771]: NO Node 'PLD0214' left the cluster 2014-03-04T14:20:19.215220+00:00 scm1 osafimmd[1657]: NO Node 11c0f request sync sync-pid:23629 epoch:0 2014-03-04T14:20:19.296817+00:00 scm1 osafamfd[1771]: WA avd_msg_sanity_chk: invalid node ID (11e0f) 2014-03-04T14:20:19.300899+00:00 scm1 osafamfd[1771]: WA avd_msg_sanity_chk: invalid node ID (11e0f) 2014-03-04T14:20:19.305377+00:00 scm1 osafamfd[1771]: NO Node 'CMM02B' left the cluster 2014-03-04T14:20:19.357458+00:00 scm1 osafimmd[1657]: NO Node 1200f request sync sync-pid:4704 epoch:0 cmm02B (1.1.32) ----------------------- 2014-03-04T14:20:18.495174+00:00 cmm02b kernel: TIPC: Resetting link <1.1.32:eth2-1.1.10:bond0>, peer not responding 2014-03-04T14:20:18.495203+00:00 cmm02b kernel: TIPC: Lost link <1.1.32:eth2-1.1.10:bond0> on network plane A 2014-03-04T14:20:18.495209+00:00 cmm02b kernel: TIPC: Lost contact with <1.1.10> 2014-03-04T14:20:18.501981+00:00 cmm02b kernel: TIPC: Resetting link <1.1.32:eth2-1.1.15:eth2>, peer not responding 2014-03-04T14:20:18.502012+00:00 cmm02b kernel: TIPC: Lost link <1.1.32:eth2-1.1.15:eth2> on network plane A 2014-03-04T14:20:18.502016+00:00 cmm02b kernel: TIPC: Lost contact with <1.1.15> 2014-03-04T14:20:18.502020+00:00 cmm02b kernel: TIPC: Resetting link <1.1.32:eth2-1.1.11:bond0>, peer not responding 2014-03-04T14:20:18.502023+00:00 cmm02b kernel: TIPC: Lost link <1.1.32:eth2-1.1.11:bond0> on network plane A 2014-03-04T14:20:18.502026+00:00 cmm02b kernel: TIPC: Lost contact with <1.1.11> 2014-03-04T14:20:18.502110+00:00 cmm02b kernel: TIPC: Resetting link <1.1.32:eth2-1.1.1:bond0>, peer not responding 2014-03-04T14:20:18.502115+00:00 cmm02b kernel: TIPC: Lost link <1.1.32:eth2-1.1.1:bond0> on network plane A 2014-03-04T14:20:18.502118+00:00 cmm02b kernel: TIPC: Lost contact with <1.1.1> 2014-03-04T14:20:18.549154+00:00 cmm02b kernel: TIPC: Resetting link <1.1.32:eth2-1.1.14:bond0>, peer not responding 2014-03-04T14:20:18.549180+00:00 cmm02b kernel: TIPC: Lost link <1.1.32:eth2-1.1.14:bond0> on network plane A 2014-03-04T14:20:18.549184+00:00 cmm02b kernel: TIPC: Lost contact with <1.1.14> 2014-03-04T14:20:18.671107+00:00 cmm02b kernel: TIPC: Established link <1.1.32:eth2-1.1.14:bond0> on network plane A 2014-03-04T14:20:18.743482+00:00 cmm02b kernel: TIPC: Established link <1.1.32:eth2-1.1.11:bond0> on network plane A 2014-03-04T14:20:18.866277+00:00 cmm02b kernel: TIPC: Established link <1.1.32:eth2-1.1.10:bond0> on network plane A 2014-03-04T14:20:18.869280+00:00 cmm02b kernel: TIPC: Established link <1.1.32:eth2-1.1.1:bond0> on network plane A 2014-03-04T14:20:18.954740+00:00 cmm02b kernel: TIPC: Established link <1.1.32:eth2-1.1.15:eth2> on network plane A 2014-03-04T14:20:18.959226+00:00 cmm02b osafimmnd[1347]: WA MESSAGE:38632 OUT OF ORDER my highest processed:38600, exiting 2014-03-04T14:20:18.967269+00:00 cmm02b osafamfnd[1417]: NO 'safComp=IMMND,safSu=CMM02B,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart' 2014-03-04T14:20:19.052569+00:00 cmm02b osafimmnd[4704]: Started 2014-03-04T14:20:19.157393+00:00 cmm02b osafimmnd[4704]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING 2014-03-04T14:20:19.257835+00:00 cmm02b osafimmnd[4704]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING 2014-03-04T14:20:19.358134+00:00 cmm02b osafimmnd[4704]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING 2014-03-04T14:20:19.358452+00:00 cmm02b osafimmnd[4704]: NO NODE STATE-> IMM_NODE_ISOLATED 2014-03-04T14:20:19.955686+00:00 cmm02b osafimmnd[4704]: NO NODE STATE-> IMM_NODE_W_AVAILABLE 2014-03-04T14:20:20.022473+00:00 cmm02b osafimmnd[4704]: NO SERVER STATE: IMM_SERVER_SYNC_PENDING --> IMM_SERVER_SYNC_CLIENT 2014-03-04T14:20:26.925158+00:00 cmm02b kernel: TIPC: Resetting link <1.1.32:eth2-1.1.31:eth2>, peer not responding 2014-03-04T14:20:26.925184+00:00 cmm02b kernel: TIPC: Lost link <1.1.32:eth2-1.1.31:eth2> on network plane A 2014-03-04T14:20:26.925191+00:00 cmm02b kernel: TIPC: Lost contact with <1.1.31> 2014-03-04T14:20:27.893115+00:00 cmm02b kernel: TIPC: Resetting link <1.1.32:eth2-1.1.27:bond0>, peer not responding 2014-03-04T14:20:27.893148+00:00 cmm02b kernel: TIPC: Lost link <1.1.32:eth2-1.1.27:bond0> on network plane A 2014-03-04T14:20:27.893154+00:00 cmm02b kernel: TIPC: Lost contact with <1.1.27> 2014-03-04T14:20:32.026411+00:00 cmm02b osafimmnd[4704]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 2144 2014-03-04T14:20:32.026463+00:00 cmm02b osafimmnd[4704]: NO RepositoryInitModeT is SA_IMM_INIT_FROM_FILE 2014-03-04T14:20:32.026493+00:00 cmm02b osafimmnd[4704]: NO Epoch set to 22 in ImmModel 2014-03-04T14:20:32.031737+00:00 cmm02b osafimmnd[4704]: NO Implementer connected: 72 (MsgQueueService73743) <67, 1200f> 2014-03-04T14:20:32.035966+00:00 cmm02b osafimmnd[4704]: NO Implementer connected: 73 (MsgQueueService73231) <0, 11e0f> 2014-03-04T14:20:32.041233+00:00 cmm02b osafimmnd[4704]: NO SERVER STATE: IMM_SERVER_SYNC_CLIENT --> IMM SERVER READY 2014-03-04T14:20:32.042213+00:00 cmm02b osafimmnd[4704]: NO Implementer connected: 74 (MsgQueueService72719) <0, 11c0f> 2014-03-04T14:20:32.047252+00:00 cmm02b osafimmnd[4704]: NO Implementer connected: 75 (MsgQueueService71439) <0, 1170f> 2014-03-04T14:20:46.911220+00:00 cmm02b osafimmnd[4704]: NO Implementer connected: 76 (MsgQueueService73487) <0, 10f0f> 2014-03-04T14:20:46.920751+00:00 cmm02b osafimmnd[4704]: NO Implementer disconnected 76 <0, 10f0f> (MsgQueueService73487) 2014-03-04T14:20:48.012244+00:00 cmm02b osafimmnd[4704]: NO Implementer connected: 77 (MsgQueueService72463) <0, 10f0f> 2014-03-04T14:20:48.014779+00:00 cmm02b osafimmnd[4704]: NO Implementer disconnected 77 <0, 10f0f> (MsgQueueService72463) 2014-03-04T14:21:13.653100+00:00 cmm02b kernel: TIPC: Established link <1.1.32:eth2-1.1.31:eth2> on network plane A 2014-03-04T14:21:14.052200+00:00 cmm02b osafimmnd[4704]: NO NODE STATE-> IMM_NODE_R_AVAILABLE 2014-03-04T14:21:20.913433+00:00 cmm02b osafimmnd[4704]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 14277 2014-03-04T14:21:20.913963+00:00 cmm02b osafimmnd[4704]: NO Epoch set to 23 in ImmModel 2014-03-04T14:21:21.419157+00:00 cmm02b osafimmnd[4704]: NO Implementer connected: 78 (MsgQueueService73487) <0, 11f0f> 2014-03-04T14:21:40.874192+00:00 cmm02b kernel: TIPC: Established link <1.1.32:eth2-1.1.27:bond0> on network plane A 2014-03-04T14:21:42.179625+00:00 cmm02b osafimmnd[4704]: NO NODE STATE-> IMM_NODE_R_AVAILABLE 2014-03-04T14:21:46.871328+00:00 cmm02b osafimmnd[4704]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 14277 2014-03-04T14:21:46.871755+00:00 cmm02b osafimmnd[4704]: NO Epoch set to 24 in ImmModel 2014-03-04T14:21:47.649858+00:00 cmm02b osafimmnd[4704]: NO Implementer connected: 79 (MsgQueueService72463) <0, 11b0f> ------------------------------------------------------------------------------ Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-users mailing list [email protected]<mailto:[email protected]> https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ Subversion Kills Productivity. Get off Subversion & Make the Move to Perforce. With Perforce, you get hassle-free workflows. Merge that actually works. Faster operations. Version large binaries. Built-in WAN optimization and the freedom to use Git, Perforce or both. Make the move to Perforce. http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
