Re: [users] Payload card reboot due to a short time network break

Anders Widell Mon, 09 Apr 2018 11:32:33 -0700

The only way to be sure if it is appropriate is to test under realisticconditions. I agree that it makes sense to increase it so that it islarger than the TIPC link tolerance. It should be noted that the IMMagent always communicates directly with the IMM node director running onthe same node, and for this communication I don't think the TIPC linktolerance is relevant (you will immediately detect if the IMM nodedirector process goes away). However, the IMM node director may in turnhave to communicate with IMM processes running on other nodes in thecluster in order to fulfill your request, and for that communication theTIPC link tolerance comes into play. If it needs to communicate inseveral hops it may even make sense to have a time-out which is severaltimes the TIPC link tolerance (compare with the default values for thesetime-outs: link tolerance=1.5 seconds and IMMA time-out=10 seconds).


regards,


Anders Widell


On 04/09/2018 10:19 AM, Jianfeng Dong wrote:


Hi Anders,

Now we want to increase TIPC tolerance from current 10 seconds to 12or 15, thus we also need to increase a OpenSAF parameter‘IMMA_SYNCR_TIMEOUT’ from current 12 seconds to a bigger value(20maybe), do you think 20 seconds is proper for the parameter?


Thanks.

Regards,

Jianfeng

*From:*Jianfeng Dong
*Sent:* Tuesday, March 13, 2018 5:38 PM

*To:* Anders Widell <[email protected]>; Mathi N P<[email protected]>

*Cc:* [email protected]

*Subject:* RE: [users] Payload card reboot due to a short time networkbreak


Anders,

As you can see in those logs we had set the TIPC link tolerance to 10seconds, I’m just not sure how long is proper especially for this case.

I think I can take a try at least, to turn TIPC running on theEthernet interfaces instead.

Thanks for your comment for the CLM design idea, I understand itdefinitely would not be easy to make such a change.


Thanks,

Jianfeng

*From:*Anders Widell [mailto:[email protected]]
*Sent:* Monday, March 12, 2018 7:52 PM

*To:* Mathi N P <[email protected]<mailto:[email protected]>>; Jianfeng Dong <[email protected]<mailto:[email protected]>>*Cc:* [email protected]<mailto:[email protected]>*Subject:* Re: [users] Payload card reboot due to a short time networkbreak

We also tried running TIPC on a bonded interface but ended up havingto change it since it never worked well. When you have two redundantEthernet interfaces, TIPC will tolerate failures in one of themseamlessly without losing connectivity. But when you run TIPC on abonded interface it doesn't work, as you can see in your case. I guessthe reason is that you have two separate mechanisms on top of eachother, trying to achieve the same thing. One possible workaround is toincrease the TIPC link tolerance.

When we lose connectivity with a node in the cluster, we are expectingthat it happened because the other node went down (rebooted orpermanently died). We don't expect to re-establish connectivity withthe same node unless it has rebooted in between. It would be possibleto introduce a grace time to allow a node to stay in the CLM clusterfor a while after the connectivity with it has been lost, and allow itto continue as a cluster member if connectivity is re-establishedbefore this grace time has expired. However, this is not so easy andit is much easier to increase the TIPC link tolerance and let TIPChandle this for us.


regards,

Anders Widell

On 03/09/2018 12:42 PM, Mathi N P wrote:

    This is an interesting case (and 'rare' :-))

    2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending
    node reboot order to
    node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late
    node_up_msg after cluster startup timeout
    2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error, then
    it got the reboot command from SC and thus it reboot itself.

    Given that the node has not 'instantiated' completely and a reboot
    order can be treated as a 'failed start up', based on the current
    AMF state,

    AMF can make a decision by reading the
    'saamfnodefailfastoninstantiationfailure' (or perhaps
    'saamfnodeautorepair' ) attribute to reboot or not and report a
    node instantantiation failure (back to the rc script and other
    associated events for that state).

    Thanks,

    Mathi.

    On Fri, Mar 9, 2018 at 10:42 AM, Jianfeng Dong <[email protected]
    <mailto:[email protected]>> wrote:

        Thanks Anders, much appreciate.

        And yes, in PLD we run TIPC on a bonded interface which
        comprises two Ethernet interfaces.
        I'm wondering why a bonding interface can't provide similar
        protection like TIPC does, is it because TIPC is more robust
        or something else? I'm not sure if it is right to change the
        low-level design at this time point for our product, I will
        talk with my workmates on this change and find more details in
        TIPC manual.

        Regarding to OpenSAF part, do you guys think is it possible
        that SC do not force rebooting the PLD in this case? After all
        the connection recovered quickly.

        Regards,
        Jianfeng


        -----Original Message-----
        From: Anders Widell [mailto:[email protected]
        <mailto:[email protected]>]
        Sent: Thursday, March 8, 2018 8:38 PM
        To: Jianfeng Dong <[email protected]
        <mailto:[email protected]>>;
        [email protected]
        <mailto:[email protected]>
        Subject: Re: [users] Payload card reboot due to a short time
        network break

        Hi!

        Are you running TIPC on a bonded interface? I wouldn't
        recommend this.
        Instead, you should run TIPC on the raw Ethernet interfaces
        and let TIPC handle the link fail-over in case of a failure in
        one of them. TIPC should be able to do this without ever
        losing the connectivity between the nodes.

        regards,

        Anders Widell


        On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
        > Hi,
        >
        > Several days ago we got a payload card reboot issue in
        customer field, a PLD lost connection with SC for a little
        while(about 10 seconds), then SC forced the PLD to reboot even
        though the PLD was going into “SC Absent mode”.
        >
        > System summary:
        > our product is a system with 2 SC boards and at most 14 PLD
        cards, running OpenSAF 5.1.0 with the feature “SC Absent Mode”
        enabled, and SC connect with PLD via Ethernet and TIPC.
        >
        > Issue course:
        > 1. PLD’s internal network went down for a hardware/driver
        problem, but it recovered quickly in 2 seconds.
        >
        > 2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding:
        bond0: link
        > status definitely down for interface eth0, disabling it
        > 2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding:
        bond0: link status up for interface eth0, enabling it in 60000 ms.
        >
        > 2. 10 seconds later TIPC still broke even though the network
        got recovered.
        >
        > 2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc:
        Resetting link
        > <1.1.14:bond0-1.1.16:eth2>, peer not responding
        > 2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link
        > <1.1.14:bond0-1.1.16:eth2> on network plane A
        > 2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost
        contact
        > with <1.1.16>
        >
        > 3. SC found the PLD left the cluster.
        >
        > 2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS
        event
        > from svc_id 25 (change:4, dest:296935520731140)
        > 2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO
        Node 69135
        > went down. Not sending track callback for agents on that node
        > 2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global
        > discard node received for nodeId:10e0f pid:3516
        > 2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO
        Implementer
        > disconnected 15 <0, 10e0f(down)> (MsgQueueService69135)
        > 2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node
        > 'PLD0114' left the cluster
        >
        > 4. One more second later, the TIPC link also got recovered.
        >
        > 2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc:
        Established
        > link <1.1.14:bond0-1.1.16:eth2> on network plane A
        >
        > 5. However, PLD was still impacted by the network issue and
        was trying to go into ‘SC Absent Mode’.
        >
        > 2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD
        > NEW_ACTIVE, adest:1
        > 2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO
        Sending
        > node up due to NCSMDS_NEW_ACTIVE
        > 2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO
        19 SISU
        > states sent
        > 2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO
        22 SU
        > states sent
        > 2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO
        Sleep
        > done registering IMMND with MDS
        > 2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER
        > saClmDispatch Failed with error 9
        > 2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]:
        NO Bad CLM handle. Reinitializing.
        > 2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO
        SUCCESS
        > IN REGISTERING IMMND WITH MDS
        > 2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO
        > Re-introduce-me highestProcessed:26209 highestReceived:26209
        > 2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO
        IMMD
        > service is UP ... ScAbsenseAllowed?:315360000 introduced?:2
        > 2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO
        MDS:
        > mds_register_callback: dest 10e0fb03c0010 already exist
        > 2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO
        > Re-introduce-me highestProcessed:26209 highestReceived:26209
        > 2018-02-16T17:56:11.062053+00:00 pld0114 osafamfnd[3626]: NO 25
        > CSICOMP states synced
        > 2018-02-16T17:56:11.062102+00:00 pld0114 osafamfnd[3626]: NO
        28 SU
        > states sent
        > 2018-02-16T17:56:11.064418+00:00 pld0114 osafimmnd[3516]: ER
        > MESSAGE:26438 OUT OF ORDER my highest processed:26209 - exiting
        > 2018-02-16T17:56:11.160121+00:00 pld0114 osafckptnd[3697]:
        NO CLM
        > selection object was updated. (12)
        > 2018-02-16T17:56:11.166764+00:00 pld0114 osafamfnd[3626]: NO
        > saClmDispatch BAD_HANDLE
        > 2018-02-16T17:56:11.167030+00:00 pld0114 osafamfnd[3626]: NO
        > 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' component restart
        probation
        > timer started (timeout: 60000000000 ns)
        > 2018-02-16T17:56:11.167102+00:00 pld0114 osafamfnd[3626]: NO
        > Restarting a component of
        'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF'
        > (comp restart count: 1)
        > 2018-02-16T17:56:11.167135+00:00 pld0114 osafamfnd[3626]: NO
        'safComp=IMMND,safSu=PLD0114,safSg=NoRed,safApp=OpenSAF'
        faulted due to 'avaDown' : Recovery is 'componentRestart'
        >
        > 6. SC received messages from the PLD, then it forced the PLD
        to reboot(due to the node sync timeout?).
        >
        > 2018-02-16T17:56:11.058121+00:00 scm2 osafimmd[3095]: NO MDS
        event
        > from svc_id 25 (change:3, dest:296935520731140)
        > 2018-02-16T17:56:11.058515+00:00 scm2 osafsmfd[3391]: ER
        > saClmClusterNodeGet failed, rc=SA_AIS_ERR_NOT_EXIST (12)
        > 2018-02-16T17:56:11.059607+00:00 scm2 osafimmd[3095]:
        ncs_sel_obj_ind:
        > write failed - Bad file descriptor
        > 2018-02-16T17:56:11.060307+00:00 scm2 osafimmd[3095]:
        ncs_sel_obj_ind:
        > write failed - Bad file descriptor
        > 2018-02-16T17:56:11.060811+00:00 scm2 osafimmd[3095]: NO
        ACT: New
        > Epoch for IMMND process at node 10e0f old epoch: 0  new epoch:6
        > 2018-02-16T17:56:11.062673+00:00 scm2 osafamfd[3312]: NO Receive
        > message with event type:12, msg_type:31, from node:10e0f,
        msg_id:0
        > 2018-02-16T17:56:11.065743+00:00 scm2 osafsmfd[3391]: WA
        > proc_mds_info: SMFND UP failed
        > 2018-02-16T17:56:11.067053+00:00 scm2 osafamfd[3312]: NO Receive
        > message with event type:13, msg_type:32, from node:10e0f,
        msg_id:0
        > 2018-02-16T17:56:11.073149+00:00 scm2 osafimmd[3095]: NO MDS
        event
        > from svc_id 25 (change:4, dest:296935520731140)
        > 2018-02-16T17:56:11.073751+00:00 scm2 osafimmnd[3106]: NO Global
        > discard node received for nodeId:10e0f pid:3516
        > 2018-02-16T17:56:11.169774+00:00 scm2 osafamfd[3312]: WA
        > avd_msg_sanity_chk: invalid msg id 2, msg type 8, from 10e0f
        should be
        > 1
        > 2018-02-16T17:56:11.170443+00:00 scm2 osafamfd[3312]: WA
        > avd_msg_sanity_chk: invalid msg id 3, msg type 8, from 10e0f
        should be
        > 1
        >
        > 2018-02-16T17:56:21.167793+00:00 scm2 osafamfd[3312]: NO
        NodeSync
        > timeout
        > 2018-02-16T17:56:41.172730+00:00 scm2 osafamfd[3312]: NO
        Received
        > node_up from 10e0f: msg_id 1
        > 2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA
        Sending node
        > reboot order to
        node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster,
        > due to late node_up_msg after cluster startup timeout
        > 2018-02-16T17:56:41.272684+00:00 scm2 osafimmd[3095]: NO MDS
        event
        > from svc_id 25 (change:3, dest:296933047877705)
        > 2018-02-16T17:56:41.478486+00:00 scm2 osafimmd[3095]: NO
        Node 10e0f
        > request sync sync-pid:29026 epoch:0
        > 2018-02-16T17:56:43.714855+00:00 scm2 osafimmnd[3106]: NO
        Announce
        > sync, epoch:7
        > 2018-02-16T17:56:43.714960+00:00 scm2 osafimmnd[3106]: NO SERVER
        > STATE: IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
        > 2018-02-16T17:56:43.715406+00:00 scm2 osafimmd[3095]: NO
        Successfully
        > announced sync. New ruling epoch:7
        > 2018-02-16T17:56:43.715498+00:00 scm2 osafimmnd[3106]: NO
        NODE STATE->
        > IMM_NODE_R_AVAILABLE
        > 2018-02-16T17:56:43.826039+00:00 scm2 osafimmloadd: NO Sync
        starting
        > 2018-02-16T17:56:56.278337+00:00 scm2 osafimmd[3095]: NO MDS
        event
        > from svc_id 25 (change:4, dest:296933047877705)
        > 2018-02-16T17:56:56.279314+00:00 scm2 osafamfd[3312]: NO Node
        > 'PLD0114' left the cluster
        > 2018-02-16T17:56:56.283580+00:00 scm2 osafimmnd[3106]: NO Global
        > discard node received for nodeId:10e0f pid:29026
        > 2018-02-16T17:56:58.705379+00:00 scm2 osafimmloadd: IN
        Synced 6851
        > objects in total
        > 2018-02-16T17:56:58.705750+00:00 scm2 osafimmnd[3106]: NO
        NODE STATE->
        > IMM_NODE_FULLY_AVAILABLE 18455
        > 2018-02-16T17:56:58.707065+00:00 scm2 osafimmnd[3106]: NO
        Epoch set to
        > 7 in ImmModel
        > 2018-02-16T17:56:58.707274+00:00 scm2 osafimmd[3095]: NO
        ACT: New
        > Epoch for IMMND process at node 1100f old epoch: 6  new epoch:7
        > 2018-02-16T17:56:58.707833+00:00 scm2 osafimmd[3095]: NO
        ACT: New
        > Epoch for IMMND process at node 1010f old epoch: 6  new epoch:7
        > 2018-02-16T17:56:58.708905+00:00 scm2 osafimmloadd: NO Sync
        ending
        > normally
        > 2018-02-16T17:56:58.802050+00:00 scm2 osafimmnd[3106]: NO SERVER
        > STATE: IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
        >
        > 7. On the PLD side, seems nothing happened there from
        2018-02-16T17:56:11 to 2018-02-16T17:56:41 except an error,
        then it got the reboot command from SC and thus it reboot itself.
        >
        > 2018-02-16T17:56:41.172501+00:00 pld0114 osafamfnd[3626]: CR
        saImmOmInitialize failed. Use previous value of nodeName.
        > 2018-02-16T17:56:41.174650+00:00 pld0114 osafamfnd[3626]: NO
        Received reboot order, ordering reboot now!
        > 2018-02-16T17:56:41.174711+00:00 pld0114 osafamfnd[3626]:
        Rebooting
        > OpenSAF NodeId = 69135 EE Name = , Reason: Received reboot
        order,
        > OwnNodeId = 69135, SupervisionTime = 0
        > 2018-02-16T17:56:41.268516+00:00 pld0114 osafimmnd[29026]:
        Started
        > 2018-02-16T17:56:41.276505+00:00 pld0114 osafimmnd[29026]:
        NO IMMD
        > service is UP ... ScAbsenseAllowed?:0 introduced?:0
        > 2018-02-16T17:56:41.277035+00:00 pld0114 osafimmnd[29026]:
        NO SERVER
        > STATE: IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
        > 2018-02-16T17:56:41.378265+00:00 pld0114 osafimmnd[29026]:
        NO SERVER
        > STATE: IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
        > 2018-02-16T17:56:41.478558+00:00 pld0114 osafimmnd[29026]:
        NO SERVER
        > STATE: IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
        > 2018-02-16T17:56:41.479004+00:00 pld0114 osafimmnd[29026]:
        NO NODE
        > STATE-> IMM_NODE_ISOLATED
        >
        >
        > IMO for this case maybe it’s better:
        > 1) on SC side, not to reboot the PLD if it recover quickly
        from a network break issue, like this case.
        > 2) on PLD side, stop the process to going into ‘SC Absent Mode’.
        >
        > But actually I’m not sure if OpenSAF should be able to
        handle a network break like this one, I’m also not sure if it
        is the other problem(the NodeSync timeout?) cause the reboot,
        so any comment would be appreciated, thanks.
        >
        > Regards,
        > Jianfeng
        >
        >
        ----------------------------------------------------------------------

        > -------- Check out the vibrant tech community on one of the
        world's
        > most engaging tech sites, Slashdot.org!
        >
        https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&;
        >
        d=DwIFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnO
        >
        GUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=Bh1TOB0VYKVVuw1VvZiesbRRukTfQfZU16pu9
        > zLTqMY&s=7LNHDwh2J8V2mbCutdb1pEm5QUVY3fDnBeS8gJFQ5-g&e=
        > _______________________________________________
        > Opensaf-users mailing list
        > [email protected]
        <mailto:[email protected]>
        >
        https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge
        >
        .net_lists_listinfo_opensaf-2Dusers&d=DwIFaQ&c=HAkYuh63rsuhr6Scbfh0UjB
        >
        XeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnOGUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=B
        >
        h1TOB0VYKVVuw1VvZiesbRRukTfQfZU16pu9zLTqMY&s=RlhXdyUnNM1cgr4CBSLsKb7cA
        > heTclsUBlyAcTPFIXk&e=



        
------------------------------------------------------------------------------
        Check out the vibrant tech community on one of the world's most
        engaging tech sites, Slashdot.org! http://sdm.link/slashdot
        
<https://urldefense.proofpoint.com/v2/url?u=http-3A__sdm.link_slashdot&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnOGUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=kF0rk6gUqfdyo6_UInQgQbTnOrVLKlgzDp38n2e0q4E&s=_rgULZeJ_Mz7ki6pTzsUTyKt85k5SXmKZWfQ21uHaJU&e=>
        _______________________________________________
        Opensaf-users mailing list
        [email protected]
        <mailto:[email protected]>
        https://lists.sourceforge.net/lists/listinfo/opensaf-users
        
<https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_opensaf-2Dusers&d=DwMFaQ&c=HAkYuh63rsuhr6Scbfh0UjBXeMK-ndb3voDTXcWzoCI&r=506epIt5fsRnOGUNV9RMO0jfx4u8vdegcV0mcKujXlI&m=kF0rk6gUqfdyo6_UInQgQbTnOrVLKlgzDp38n2e0q4E&s=cEBIyga4e2rIlZPa1Av7lluryVQu3aq6WNlrGlbv-L4&e=>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Re: [users] Payload card reboot due to a short time network break

Reply via email to