Hi, Several days ago we got a payload card reboot issue in customer field, a PLD lost connection with SC for a little while(about 10 seconds), then SC forced the PLD to reboot even though the PLD was going into “SC Absent mode”.
System summary: our product is a system with 2 SC boards and at most 14 PLD cards, running OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with PLD via Ethernet and TIPC. Issue course: 1. PLD’s internal network went down for a hardware/driver problem, but it recovered quickly in 2 seconds. 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it 2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link status up for interface eth0, enabling it in 60000 ms. 2. 10 seconds later TIPC still broke even though the network got recovered. 2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link <1.1.14:bond0-1.1.16:eth2>, peer not responding 2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link <1.1.14:bond0-1.1.16:eth2> on network plane A 2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact with <1.1.16> 3. SC found the PLD left the cluster. 2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 25 (change:4, dest:296935520731140) 2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135 went down. Not sending track callback for agents on that node 2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global discard node received for nodeId:10e0f pid:3516 2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer disconnected 15 <0, 10e0f(down)> (MsgQueueService69135) 2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node 'PLD0114' left the cluster 4. One more second later, the TIPC link also got recovered. 2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established link <1.1.14:bond0-1.1.16:eth2> on network plane A 5. However, PLD was still impacted by the network issue and was trying to go into ‘SC Absent Mode’. 2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD NEW_ACTIVE, adest:1 2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending node up due to NCSMDS_NEW_ACTIVE 2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU states sent 2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU states sent 2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep done registering IMMND with MDS 2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER saClmDispatch Failed with error 9 2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM handle. Reinitializing. 2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS IN REGISTERING IMMND WITH MDS 2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me highestProcessed:26209 highestReceived:26209 2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD service is UP ... ScAbsenseAllowed?:315360000 introduced?:2 2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS: mds_register_callback: dest 10e0fb03c0010 already exist 2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me highestProcessed:26209 highestReceived:26209 2018-02-16T17:56:11.062053+00:00 pld0114 osafamfnd[3626]: NO 25 CSICOMP states synced 2018-02-16T17:56:11.062102+00:00 pld0114 osafamfnd[3626]: NO 28 SU states sent 2018-02-16T17:56:11.064418+00:00 pld0114 osafimmnd[3516]: ER MESSAGE:26438 OUT OF ORDER my highest processed:26209 - exiting 2018-02-16T17:56:11.160121+00:00 pld0114 osafckptnd[3697]: NO CLM selection object was updated. (12) 2018-02-16T17:56:11.166764+00:00 pld0114 osafamfnd[3626]: NO saClmDispatch BAD_HANDLE 2018-02-16T17:56:11.167030+00:00 pld0114 osafamfnd[3626]: NO 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' component restart probation timer started (timeout: 60000000000 ns) 2018-02-16T17:56:11.167102+00:00 pld0114 osafamfnd[3626]: NO Restarting a component of 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1) 2018-02-16T17:56:11.167135+00:00 pld0114 osafamfnd[3626]: NO 'safComp=IMMND,safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' : Recovery is 'componentRestart' 6. SC received messages from the PLD, then it forced the PLD to reboot(due to the node sync timeout?). 2018-02-16T17:56:11.058121+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 25 (change:3, dest:296935520731140) 2018-02-16T17:56:11.058515+00:00 scm2 osafsmfd[3391]: ER saClmClusterNodeGet failed, rc=SA_AIS_ERR_NOT_EXIST (12) 2018-02-16T17:56:11.059607+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: write failed - Bad file descriptor 2018-02-16T17:56:11.060307+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: write failed - Bad file descriptor 2018-02-16T17:56:11.060811+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for IMMND process at node 10e0f old epoch: 0 new epoch:6 2018-02-16T17:56:11.062673+00:00 scm2 osafamfd[3312]: NO Receive message with event type:12, msg_type:31, from node:10e0f, msg_id:0 2018-02-16T17:56:11.065743+00:00 scm2 osafsmfd[3391]: WA proc_mds_info: SMFND UP failed 2018-02-16T17:56:11.067053+00:00 scm2 osafamfd[3312]: NO Receive message with event type:13, msg_type:32, from node:10e0f, msg_id:0 2018-02-16T17:56:11.073149+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 25 (change:4, dest:296935520731140) 2018-02-16T17:56:11.073751+00:00 scm2 osafimmnd[3106]: NO Global discard node received for nodeId:10e0f pid:3516 2018-02-16T17:56:11.169774+00:00 scm2 osafamfd[3312]: WA avd_msg_sanity_chk: invalid msg id 2, msg type 8, from 10e0f should be 1 2018-02-16T17:56:11.170443+00:00 scm2 osafamfd[3312]: WA avd_msg_sanity_chk: invalid msg id 3, msg type 8, from 10e0f should be 1 2018-02-16T17:56:21.167793+00:00 scm2 osafamfd[3312]: NO NodeSync timeout 2018-02-16T17:56:41.172730+00:00 scm2 osafamfd[3312]: NO Received node_up from 10e0f: msg_id 1 2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late node_up_msg after cluster startup timeout 2018-02-16T17:56:41.272684+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 25 (change:3, dest:296933047877705) 2018-02-16T17:56:41.478486+00:00 scm2 osafimmd[3095]: NO Node 10e0f request sync sync-pid:29026 epoch:0 2018-02-16T17:56:43.714855+00:00 scm2 osafimmnd[3106]: NO Announce sync, epoch:7 2018-02-16T17:56:43.714960+00:00 scm2 osafimmnd[3106]: NO SERVER STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER 2018-02-16T17:56:43.715406+00:00 scm2 osafimmd[3095]: NO Successfully announced sync. New ruling epoch:7 2018-02-16T17:56:43.715498+00:00 scm2 osafimmnd[3106]: NO NODE STATE-> IMM_NODE_R_AVAILABLE 2018-02-16T17:56:43.826039+00:00 scm2 osafimmloadd: NO Sync starting 2018-02-16T17:56:56.278337+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 25 (change:4, dest:296933047877705) 2018-02-16T17:56:56.279314+00:00 scm2 osafamfd[3312]: NO Node 'PLD0114' left the cluster 2018-02-16T17:56:56.283580+00:00 scm2 osafimmnd[3106]: NO Global discard node received for nodeId:10e0f pid:29026 2018-02-16T17:56:58.705379+00:00 scm2 osafimmloadd: IN Synced 6851 objects in total 2018-02-16T17:56:58.705750+00:00 scm2 osafimmnd[3106]: NO NODE STATE-> IMM_NODE_FULLY_AVAILABLE 18455 2018-02-16T17:56:58.707065+00:00 scm2 osafimmnd[3106]: NO Epoch set to 7 in ImmModel 2018-02-16T17:56:58.707274+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for IMMND process at node 1100f old epoch: 6 new epoch:7 2018-02-16T17:56:58.707833+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for IMMND process at node 1010f old epoch: 6 new epoch:7 2018-02-16T17:56:58.708905+00:00 scm2 osafimmloadd: NO Sync ending normally 2018-02-16T17:56:58.802050+00:00 scm2 osafimmnd[3106]: NO SERVER STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY 7. On the PLD side, seems nothing happened there from 2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the reboot command from SC and thus it reboot itself. 2018-02-16T17:56:41.172501+00:00 pld0114 osafamfnd[3626]: CR saImmOmInitialize failed. Use previous value of nodeName. 2018-02-16T17:56:41.174650+00:00 pld0114 osafamfnd[3626]: NO Received reboot order, ordering reboot now! 2018-02-16T17:56:41.174711+00:00 pld0114 osafamfnd[3626]: Rebooting OpenSAF NodeId = 69135 EE Name = , Reason: Received reboot order, OwnNodeId = 69135, SupervisionTime = 0 2018-02-16T17:56:41.268516+00:00 pld0114 osafimmnd[29026]: Started 2018-02-16T17:56:41.276505+00:00 pld0114 osafimmnd[29026]: NO IMMD service is UP ... ScAbsenseAllowed?:0 introduced?:0 2018-02-16T17:56:41.277035+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING 2018-02-16T17:56:41.378265+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING 2018-02-16T17:56:41.478558+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING 2018-02-16T17:56:41.479004+00:00 pld0114 osafimmnd[29026]: NO NODE STATE-> IMM_NODE_ISOLATED IMO for this case maybe it’s better: 1) on SC side, not to reboot the PLD if it recover quickly from a network break issue, like this case. 2) on PLD side, stop the process to going into ‘SC Absent Mode’. But actually I’m not sure if OpenSAF should be able to handle a network break like this one, I’m also not sure if it is the other problem(the NodeSync timeout?) cause the reboot, so any comment would be appreciated, thanks. Regards, Jianfeng ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users