This is an interesting case (and 'rare' :-))

2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node
reboot order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to
late node_up_msg after cluster startup timeout
2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the
reboot command from SC and thus it reboot itself.

Given that the node has not 'instantiated' completely and a reboot order
can be treated as a 'failed start up', based on the current AMF state,
AMF can make a decision by reading the
'saamfnodefailfastoninstantiationfailure' (or perhaps 'saamfnodeautorepair'
) attribute to reboot or not and report a node instantantiation failure
(back to the rc script and other associated events for that state).

Thanks,
Mathi.



On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong <jd...@juniper.net> wrote:

> Thanks Anders, much appreciate.
>
> And yes, in PLD we run TIPC on a bonded interface which comprises two
> Ethernet interfaces.
> I'm wondering why a bonding interface can't provide similar protection
> like TIPC does, is it because TIPC is more robust or something else? I'm
> not sure if it is right to change the low-level design at this time point
> for our product, I will talk with my workmates on this change and find more
> details in TIPC manual.
>
> Regarding to OpenSAF part, do you guys think is it possible that SC do not
> force rebooting the PLD in this case? After all the connection recovered
> quickly.
>
> Regards,
> Jianfeng
>
> -----Original Message-----
> From: Anders Widell [mailto:anders.wid...@ericsson.com]
> Sent: Thursday, March 8, 2018 8:38 PM
> To: Jianfeng Dong <jd...@juniper.net>; opensaf-users@lists.sourceforge.net
> Subject: Re: [users] Payload card reboot due to a short time network break
>
> Hi!
>
> Are you running TIPC on a bonded interface? I wouldn't recommend this.
> Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC
> handle the link fail-over in case of a failure in one of them. TIPC should
> be able to do this without ever losing the connectivity between the nodes.
>
> regards,
>
> Anders Widell
>
>
> On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
> > Hi,
> >
> > Several days ago we got a payload card reboot issue in customer field, a
> PLD lost connection with SC for a little while(about 10 seconds), then SC
> forced the PLD to reboot even though the PLD was going into “SC Absent
> mode”.
> >
> > System summary:
> > our product is a system with 2 SC boards and at most 14 PLD cards,
> running OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC
> connect with PLD via Ethernet and TIPC.
> >
> > Issue course:
> > 1. PLD’s internal network went down for a hardware/driver problem, but
> it recovered quickly in 2 seconds.
> >
> > 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link
> > status definitely down for interface eth0, disabling it
> > 2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link
> status up for interface eth0, enabling it in 60000 ms.
> >
> > 2. 10 seconds later TIPC still broke even though the network got
> recovered.
> >
> > 2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link
> > <1.1.14:bond0-1.1.16:eth2>, peer not responding
> > 2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link
> > <1.1.14:bond0-1.1.16:eth2> on network plane A
> > 2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact
> > with <1.1.16>
> >
> > 3. SC found the PLD left the cluster.
> >
> > 2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event
> > from svc_id 25 (change:4, dest:296935520731140)
> > 2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135
> > went down. Not sending track callback for agents on that node
> > 2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global
> > discard node received for nodeId:10e0f pid:3516
> > 2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer
> > disconnected 15 <0, 10e0f(down)> (MsgQueueService69135)
> > 2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node
> > 'PLD0114' left the cluster
> >
> > 4. One more second later, the TIPC link also got recovered.
> >
> > 2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established
> > link <1.1.14:bond0-1.1.16:eth2> on network plane A
> >
> > 5. However, PLD was still impacted by the network issue and was trying
> to go into ‘SC Absent Mode’.
> >
> > 2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD
> > NEW_ACTIVE, adest:1
> > 2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending
> > node up due to NCSMDS_NEW_ACTIVE
> > 2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU
> > states sent
> > 2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU
> > states sent
> > 2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep
> > done registering IMMND with MDS
> > 2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER
> > saClmDispatch Failed with error 9
> > 2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM
> handle. Reinitializing.
> > 2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS
> > IN REGISTERING IMMND WITH MDS
> > 2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO
> > Re-introduce-me highestProcessed:26209 highestReceived:26209
> > 2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD
> > service is UP ... ScAbsenseAllowed?:315360000 introduced?:2
> > 2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS:
> > mds_register_callback: dest 10e0fb03c0010 already exist
> > 2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO
> > Re-introduce-me highestProcessed:26209 highestReceived:26209
> > 2018-02-16T17:56:11.062053+00:00 pld0114 osafamfnd[3626]: NO 25
> > CSICOMP states synced
> > 2018-02-16T17:56:11.062102+00:00 pld0114 osafamfnd[3626]: NO 28 SU
> > states sent
> > 2018-02-16T17:56:11.064418+00:00 pld0114 osafimmnd[3516]: ER
> > MESSAGE:26438 OUT OF ORDER my highest processed:26209 - exiting
> > 2018-02-16T17:56:11.160121+00:00 pld0114 osafckptnd[3697]: NO CLM
> > selection object was updated. (12)
> > 2018-02-16T17:56:11.166764+00:00 pld0114 osafamfnd[3626]: NO
> > saClmDispatch BAD_HANDLE
> > 2018-02-16T17:56:11.167030+00:00 pld0114 osafamfnd[3626]: NO
> > 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' component restart probation
> > timer started (timeout: 60000000000 ns)
> > 2018-02-16T17:56:11.167102+00:00 pld0114 osafamfnd[3626]: NO
> > Restarting a component of 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF'
> > (comp restart count: 1)
> > 2018-02-16T17:56:11.167135+00:00 pld0114 osafamfnd[3626]: NO
> 'safComp=IMMND,safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' faulted due to
> 'avaDown' : Recovery is 'componentRestart'
> >
> > 6. SC received messages from the PLD, then it forced the PLD to
> reboot(due to the node sync timeout?).
> >
> > 2018-02-16T17:56:11.058121+00:00 scm2 osafimmd[3095]: NO MDS event
> > from svc_id 25 (change:3, dest:296935520731140)
> > 2018-02-16T17:56:11.058515+00:00 scm2 osafsmfd[3391]: ER
> > saClmClusterNodeGet failed, rc=SA_AIS_ERR_NOT_EXIST (12)
> > 2018-02-16T17:56:11.059607+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind:
> > write failed - Bad file descriptor
> > 2018-02-16T17:56:11.060307+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind:
> > write failed - Bad file descriptor
> > 2018-02-16T17:56:11.060811+00:00 scm2 osafimmd[3095]: NO ACT: New
> > Epoch for IMMND process at node 10e0f old epoch: 0  new epoch:6
> > 2018-02-16T17:56:11.062673+00:00 scm2 osafamfd[3312]: NO Receive
> > message with event type:12, msg_type:31, from node:10e0f, msg_id:0
> > 2018-02-16T17:56:11.065743+00:00 scm2 osafsmfd[3391]: WA
> > proc_mds_info: SMFND UP failed
> > 2018-02-16T17:56:11.067053+00:00 scm2 osafamfd[3312]: NO Receive
> > message with event type:13, msg_type:32, from node:10e0f, msg_id:0
> > 2018-02-16T17:56:11.073149+00:00 scm2 osafimmd[3095]: NO MDS event
> > from svc_id 25 (change:4, dest:296935520731140)
> > 2018-02-16T17:56:11.073751+00:00 scm2 osafimmnd[3106]: NO Global
> > discard node received for nodeId:10e0f pid:3516
> > 2018-02-16T17:56:11.169774+00:00 scm2 osafamfd[3312]: WA
> > avd_msg_sanity_chk: invalid msg id 2, msg type 8, from 10e0f should be
> > 1
> > 2018-02-16T17:56:11.170443+00:00 scm2 osafamfd[3312]: WA
> > avd_msg_sanity_chk: invalid msg id 3, msg type 8, from 10e0f should be
> > 1
> >
> > 2018-02-16T17:56:21.167793+00:00 scm2 osafamfd[3312]: NO NodeSync
> > timeout
> > 2018-02-16T17:56:41.172730+00:00 scm2 osafamfd[3312]: NO Received
> > node_up from 10e0f: msg_id 1
> > 2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node
> > reboot order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster,
> > due to late node_up_msg after cluster startup timeout
> > 2018-02-16T17:56:41.272684+00:00 scm2 osafimmd[3095]: NO MDS event
> > from svc_id 25 (change:3, dest:296933047877705)
> > 2018-02-16T17:56:41.478486+00:00 scm2 osafimmd[3095]: NO Node 10e0f
> > request sync sync-pid:29026 epoch:0
> > 2018-02-16T17:56:43.714855+00:00 scm2 osafimmnd[3106]: NO Announce
> > sync, epoch:7
> > 2018-02-16T17:56:43.714960+00:00 scm2 osafimmnd[3106]: NO SERVER
> > STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
> > 2018-02-16T17:56:43.715406+00:00 scm2 osafimmd[3095]: NO Successfully
> > announced sync. New ruling epoch:7
> > 2018-02-16T17:56:43.715498+00:00 scm2 osafimmnd[3106]: NO NODE STATE->
> > IMM_NODE_R_AVAILABLE
> > 2018-02-16T17:56:43.826039+00:00 scm2 osafimmloadd: NO Sync starting
> > 2018-02-16T17:56:56.278337+00:00 scm2 osafimmd[3095]: NO MDS event
> > from svc_id 25 (change:4, dest:296933047877705)
> > 2018-02-16T17:56:56.279314+00:00 scm2 osafamfd[3312]: NO Node
> > 'PLD0114' left the cluster
> > 2018-02-16T17:56:56.283580+00:00 scm2 osafimmnd[3106]: NO Global
> > discard node received for nodeId:10e0f pid:29026
> > 2018-02-16T17:56:58.705379+00:00 scm2 osafimmloadd: IN Synced 6851
> > objects in total
> > 2018-02-16T17:56:58.705750+00:00 scm2 osafimmnd[3106]: NO NODE STATE->
> > IMM_NODE_FULLY_AVAILABLE 18455
> > 2018-02-16T17:56:58.707065+00:00 scm2 osafimmnd[3106]: NO Epoch set to
> > 7 in ImmModel
> > 2018-02-16T17:56:58.707274+00:00 scm2 osafimmd[3095]: NO ACT: New
> > Epoch for IMMND process at node 1100f old epoch: 6  new epoch:7
> > 2018-02-16T17:56:58.707833+00:00 scm2 osafimmd[3095]: NO ACT: New
> > Epoch for IMMND process at node 1010f old epoch: 6  new epoch:7
> > 2018-02-16T17:56:58.708905+00:00 scm2 osafimmloadd: NO Sync ending
> > normally
> > 2018-02-16T17:56:58.802050+00:00 scm2 osafimmnd[3106]: NO SERVER
> > STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
> >
> > 7. On the PLD side, seems nothing happened there from
> 2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then it got the
> reboot command from SC and thus it reboot itself.
> >
> > 2018-02-16T17:56:41.172501+00:00 pld0114 osafamfnd[3626]: CR
> saImmOmInitialize failed. Use previous value of nodeName.
> > 2018-02-16T17:56:41.174650+00:00 pld0114 osafamfnd[3626]: NO Received
> reboot order, ordering reboot now!
> > 2018-02-16T17:56:41.174711+00:00 pld0114 osafamfnd[3626]: Rebooting
> > OpenSAF NodeId = 69135 EE Name = , Reason: Received reboot order,
> > OwnNodeId = 69135, SupervisionTime = 0
> > 2018-02-16T17:56:41.268516+00:00 pld0114 osafimmnd[29026]: Started
> > 2018-02-16T17:56:41.276505+00:00 pld0114 osafimmnd[29026]: NO IMMD
> > service is UP ... ScAbsenseAllowed?:0 introduced?:0
> > 2018-02-16T17:56:41.277035+00:00 pld0114 osafimmnd[29026]: NO SERVER
> > STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
> > 2018-02-16T17:56:41.378265+00:00 pld0114 osafimmnd[29026]: NO SERVER
> > STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
> > 2018-02-16T17:56:41.478558+00:00 pld0114 osafimmnd[29026]: NO SERVER
> > STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
> > 2018-02-16T17:56:41.479004+00:00 pld0114 osafimmnd[29026]: NO NODE
> > STATE-> IMM_NODE_ISOLATED
> >
> >
> > IMO for this case maybe it’s better:
> > 1) on SC side, not to reboot the PLD if it recover quickly from a
> network break issue, like this case.
> > 2) on PLD side, stop the process to going into ‘SC Absent Mode’.
> >
> > But actually I’m not sure if OpenSAF should be able to handle a network
> break like this one, I’m also not sure if it is the other problem(the
> NodeSync timeout?) cause the reboot, so any comment would be appreciated,
> thanks.
> >
> > Regards,
> > Jianfeng
> >
> > ----------------------------------------------------------------------
> > -------- Check out the vibrant tech community on one of the world's
> > most engaging tech sites, Slashdot.org!
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&;
> > d=DwIFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnO
> > GUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=Bh1TOB0VYKVVuw1VvZiesbRRukTfQfZU16pu9
> > zLTqMY&s=7LNHDwh2J8V2mbCutdb1pEm5QUVY3fDnBeS8gJFQ5-g&e=
> > _______________________________________________
> > Opensaf-users mailing list
> > Opensaf-users@lists.sourceforge.net
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge
> > .net_lists_listinfo_opensaf-2Dusers&d=DwIFaQ&c=HAkYuh63rsuhr6Scbfh0UjB
> > XeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnOGUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=B
> > h1TOB0VYKVVuw1VvZiesbRRukTfQfZU16pu9zLTqMY&s=RlhXdyUnNM1cgr4CBSLsKb7cA
> > heTclsUBlyAcTPFIXk&e=
>
>
> ------------------------------------------------------------
> ------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Opensaf-users mailing list
> Opensaf-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to