This is an interesting case (and 'rare' :-)) 2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late node_up_msg after cluster startup timeout 2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the reboot command from SC and thus it reboot itself.
Given that the node has not 'instantiated' completely and a reboot order can be treated as a 'failed start up', based on the current AMF state, AMF can make a decision by reading the 'saamfnodefailfastoninstantiationfailure' (or perhaps 'saamfnodeautorepair' ) attribute to reboot or not and report a node instantantiation failure (back to the rc script and other associated events for that state). Thanks, Mathi. On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong <jd...@juniper.net> wrote: > Thanks Anders, much appreciate. > > And yes, in PLD we run TIPC on a bonded interface which comprises two > Ethernet interfaces. > I'm wondering why a bonding interface can't provide similar protection > like TIPC does, is it because TIPC is more robust or something else? I'm > not sure if it is right to change the low-level design at this time point > for our product, I will talk with my workmates on this change and find more > details in TIPC manual. > > Regarding to OpenSAF part, do you guys think is it possible that SC do not > force rebooting the PLD in this case? After all the connection recovered > quickly. > > Regards, > Jianfeng > > -----Original Message----- > From: Anders Widell [mailto:anders.wid...@ericsson.com] > Sent: Thursday, March 8, 2018 8:38 PM > To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net > Subject: Re: [users] Payload card reboot due to a short time network break > > Hi! > > Are you running TIPC on a bonded interface? I wouldn't recommend this. > Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC > handle the link fail-over in case of a failure in one of them. TIPC should > be able to do this without ever losing the connectivity between the nodes. > > regards, > > Anders Widell > > > On 03/08/2018 10:43 AM, Jianfeng Dong wrote: > > Hi, > > > > Several days ago we got a payload card reboot issue in customer field, a > PLD lost connection with SC for a little while(about 10 seconds), then SC > forced the PLD to reboot even though the PLD was going into “SC Absent > mode”. > > > > System summary: > > our product is a system with 2 SC boards and at most 14 PLD cards, > running OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC > connect with PLD via Ethernet and TIPC. > > > > Issue course: > > 1. PLD’s internal network went down for a hardware/driver problem, but > it recovered quickly in 2 seconds. > > > > 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link > > status definitely down for interface eth0, disabling it > > 2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link > status up for interface eth0, enabling it in 60000 ms. > > > > 2. 10 seconds later TIPC still broke even though the network got > recovered. > > > > 2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link > > <1.1.14:bond0-1.1.16:eth2>, peer not responding > > 2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link > > <1.1.14:bond0-1.1.16:eth2> on network plane A > > 2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact > > with <1.1.16> > > > > 3. SC found the PLD left the cluster. > > > > 2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event > > from svc_id 25 (change:4, dest:296935520731140) > > 2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135 > > went down. Not sending track callback for agents on that node > > 2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global > > discard node received for nodeId:10e0f pid:3516 > > 2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer > > disconnected 15 <0, 10e0f(down)> (MsgQueueService69135) > > 2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node > > 'PLD0114' left the cluster > > > > 4. One more second later, the TIPC link also got recovered. > > > > 2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established > > link <1.1.14:bond0-1.1.16:eth2> on network plane A > > > > 5. However, PLD was still impacted by the network issue and was trying > to go into ‘SC Absent Mode’. > > > > 2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD > > NEW_ACTIVE, adest:1 > > 2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending > > node up due to NCSMDS_NEW_ACTIVE > > 2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU > > states sent > > 2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU > > states sent > > 2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep > > done registering IMMND with MDS > > 2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER > > saClmDispatch Failed with error 9 > > 2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM > handle. Reinitializing. > > 2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS > > IN REGISTERING IMMND WITH MDS > > 2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO > > Re-introduce-me highestProcessed:26209 highestReceived:26209 > > 2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD > > service is UP ... ScAbsenseAllowed?:315360000 introduced?:2 > > 2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS: > > mds_register_callback: dest 10e0fb03c0010 already exist > > 2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO > > Re-introduce-me highestProcessed:26209 highestReceived:26209 > > 2018-02-16T17:56:11.062053+00:00 pld0114 osafamfnd[3626]: NO 25 > > CSICOMP states synced > > 2018-02-16T17:56:11.062102+00:00 pld0114 osafamfnd[3626]: NO 28 SU > > states sent > > 2018-02-16T17:56:11.064418+00:00 pld0114 osafimmnd[3516]: ER > > MESSAGE:26438 OUT OF ORDER my highest processed:26209 - exiting > > 2018-02-16T17:56:11.160121+00:00 pld0114 osafckptnd[3697]: NO CLM > > selection object was updated. (12) > > 2018-02-16T17:56:11.166764+00:00 pld0114 osafamfnd[3626]: NO > > saClmDispatch BAD_HANDLE > > 2018-02-16T17:56:11.167030+00:00 pld0114 osafamfnd[3626]: NO > > 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' component restart probation > > timer started (timeout: 60000000000 ns) > > 2018-02-16T17:56:11.167102+00:00 pld0114 osafamfnd[3626]: NO > > Restarting a component of 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' > > (comp restart count: 1) > > 2018-02-16T17:56:11.167135+00:00 pld0114 osafamfnd[3626]: NO > 'safComp=IMMND,safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' faulted due to > 'avaDown' : Recovery is 'componentRestart' > > > > 6. SC received messages from the PLD, then it forced the PLD to > reboot(due to the node sync timeout?). > > > > 2018-02-16T17:56:11.058121+00:00 scm2 osafimmd[3095]: NO MDS event > > from svc_id 25 (change:3, dest:296935520731140) > > 2018-02-16T17:56:11.058515+00:00 scm2 osafsmfd[3391]: ER > > saClmClusterNodeGet failed, rc=SA_AIS_ERR_NOT_EXIST (12) > > 2018-02-16T17:56:11.059607+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: > > write failed - Bad file descriptor > > 2018-02-16T17:56:11.060307+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: > > write failed - Bad file descriptor > > 2018-02-16T17:56:11.060811+00:00 scm2 osafimmd[3095]: NO ACT: New > > Epoch for IMMND process at node 10e0f old epoch: 0 new epoch:6 > > 2018-02-16T17:56:11.062673+00:00 scm2 osafamfd[3312]: NO Receive > > message with event type:12, msg_type:31, from node:10e0f, msg_id:0 > > 2018-02-16T17:56:11.065743+00:00 scm2 osafsmfd[3391]: WA > > proc_mds_info: SMFND UP failed > > 2018-02-16T17:56:11.067053+00:00 scm2 osafamfd[3312]: NO Receive > > message with event type:13, msg_type:32, from node:10e0f, msg_id:0 > > 2018-02-16T17:56:11.073149+00:00 scm2 osafimmd[3095]: NO MDS event > > from svc_id 25 (change:4, dest:296935520731140) > > 2018-02-16T17:56:11.073751+00:00 scm2 osafimmnd[3106]: NO Global > > discard node received for nodeId:10e0f pid:3516 > > 2018-02-16T17:56:11.169774+00:00 scm2 osafamfd[3312]: WA > > avd_msg_sanity_chk: invalid msg id 2, msg type 8, from 10e0f should be > > 1 > > 2018-02-16T17:56:11.170443+00:00 scm2 osafamfd[3312]: WA > > avd_msg_sanity_chk: invalid msg id 3, msg type 8, from 10e0f should be > > 1 > > > > 2018-02-16T17:56:21.167793+00:00 scm2 osafamfd[3312]: NO NodeSync > > timeout > > 2018-02-16T17:56:41.172730+00:00 scm2 osafamfd[3312]: NO Received > > node_up from 10e0f: msg_id 1 > > 2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node > > reboot order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, > > due to late node_up_msg after cluster startup timeout > > 2018-02-16T17:56:41.272684+00:00 scm2 osafimmd[3095]: NO MDS event > > from svc_id 25 (change:3, dest:296933047877705) > > 2018-02-16T17:56:41.478486+00:00 scm2 osafimmd[3095]: NO Node 10e0f > > request sync sync-pid:29026 epoch:0 > > 2018-02-16T17:56:43.714855+00:00 scm2 osafimmnd[3106]: NO Announce > > sync, epoch:7 > > 2018-02-16T17:56:43.714960+00:00 scm2 osafimmnd[3106]: NO SERVER > > STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER > > 2018-02-16T17:56:43.715406+00:00 scm2 osafimmd[3095]: NO Successfully > > announced sync. New ruling epoch:7 > > 2018-02-16T17:56:43.715498+00:00 scm2 osafimmnd[3106]: NO NODE STATE-> > > IMM_NODE_R_AVAILABLE > > 2018-02-16T17:56:43.826039+00:00 scm2 osafimmloadd: NO Sync starting > > 2018-02-16T17:56:56.278337+00:00 scm2 osafimmd[3095]: NO MDS event > > from svc_id 25 (change:4, dest:296933047877705) > > 2018-02-16T17:56:56.279314+00:00 scm2 osafamfd[3312]: NO Node > > 'PLD0114' left the cluster > > 2018-02-16T17:56:56.283580+00:00 scm2 osafimmnd[3106]: NO Global > > discard node received for nodeId:10e0f pid:29026 > > 2018-02-16T17:56:58.705379+00:00 scm2 osafimmloadd: IN Synced 6851 > > objects in total > > 2018-02-16T17:56:58.705750+00:00 scm2 osafimmnd[3106]: NO NODE STATE-> > > IMM_NODE_FULLY_AVAILABLE 18455 > > 2018-02-16T17:56:58.707065+00:00 scm2 osafimmnd[3106]: NO Epoch set to > > 7 in ImmModel > > 2018-02-16T17:56:58.707274+00:00 scm2 osafimmd[3095]: NO ACT: New > > Epoch for IMMND process at node 1100f old epoch: 6 new epoch:7 > > 2018-02-16T17:56:58.707833+00:00 scm2 osafimmd[3095]: NO ACT: New > > Epoch for IMMND process at node 1010f old epoch: 6 new epoch:7 > > 2018-02-16T17:56:58.708905+00:00 scm2 osafimmloadd: NO Sync ending > > normally > > 2018-02-16T17:56:58.802050+00:00 scm2 osafimmnd[3106]: NO SERVER > > STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY > > > > 7. On the PLD side, seems nothing happened there from > 2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the > reboot command from SC and thus it reboot itself. > > > > 2018-02-16T17:56:41.172501+00:00 pld0114 osafamfnd[3626]: CR > saImmOmInitialize failed. Use previous value of nodeName. > > 2018-02-16T17:56:41.174650+00:00 pld0114 osafamfnd[3626]: NO Received > reboot order, ordering reboot now! > > 2018-02-16T17:56:41.174711+00:00 pld0114 osafamfnd[3626]: Rebooting > > OpenSAF NodeId = 69135 EE Name = , Reason: Received reboot order, > > OwnNodeId = 69135, SupervisionTime = 0 > > 2018-02-16T17:56:41.268516+00:00 pld0114 osafimmnd[29026]: Started > > 2018-02-16T17:56:41.276505+00:00 pld0114 osafimmnd[29026]: NO IMMD > > service is UP ... ScAbsenseAllowed?:0 introduced?:0 > > 2018-02-16T17:56:41.277035+00:00 pld0114 osafimmnd[29026]: NO SERVER > > STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING > > 2018-02-16T17:56:41.378265+00:00 pld0114 osafimmnd[29026]: NO SERVER > > STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING > > 2018-02-16T17:56:41.478558+00:00 pld0114 osafimmnd[29026]: NO SERVER > > STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING > > 2018-02-16T17:56:41.479004+00:00 pld0114 osafimmnd[29026]: NO NODE > > STATE-> IMM_NODE_ISOLATED > > > > > > IMO for this case maybe it’s better: > > 1) on SC side, not to reboot the PLD if it recover quickly from a > network break issue, like this case. > > 2) on PLD side, stop the process to going into ‘SC Absent Mode’. > > > > But actually I’m not sure if OpenSAF should be able to handle a network > break like this one, I’m also not sure if it is the other problem(the > NodeSync timeout?) cause the reboot, so any comment would be appreciated, > thanks. > > > > Regards, > > Jianfeng > > > > ---------------------------------------------------------------------- > > -------- Check out the vibrant tech community on one of the world's > > most engaging tech sites, Slashdot.org! > > https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot& > > d=DwIFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnO > > GUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=Bh1TOB0VYKVVuw1VvZiesbRRukTfQfZU16pu9 > > zLTqMY&s=7LNHDwh2J8V2mbCutdb1pEm5QUVY3fDnBeS8gJFQ5-g&e= > > _______________________________________________ > > Opensaf-users mailing list > > Opensaf-users@lists.sourceforge.net > > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge > > .net_lists_listinfo_opensaf-2Dusers&d=DwIFaQ&c=HAkYuh63rsuhr6Scbfh0UjB > > XeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnOGUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=B > > h1TOB0VYKVVuw1VvZiesbRRukTfQfZU16pu9zLTqMY&s=RlhXdyUnNM1cgr4CBSLsKb7cA > > heTclsUBlyAcTPFIXk&e= > > > ------------------------------------------------------------ > ------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > _______________________________________________ > Opensaf-users mailing list > Opensaf-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/opensaf-users > ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users