Hi!
Are you running TIPC on a bonded interface? I wouldn't recommend this.
Instead, you should run TIPC on the raw Ethernet interfaces and let TIPC
handle the link fail-over in case of a failure in one of them. TIPC
should be able to do this without ever losing the connectivity between
the nodes.
regards,
Anders Widell
On 03/08/2018 10:43 AM, Jianfeng Dong wrote:
Hi,
Several days ago we got a payload card reboot issue in customer field, a PLD
lost connection with SC for a little while(about 10 seconds), then SC forced
the PLD to reboot even though the PLD was going into “SC Absent mode”.
System summary:
our product is a system with 2 SC boards and at most 14 PLD cards, running
OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with
PLD via Ethernet and TIPC.
Issue course:
1. PLD’s internal network went down for a hardware/driver problem, but it
recovered quickly in 2 seconds.
2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link status
definitely down for interface eth0, disabling it
2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link status up
for interface eth0, enabling it in 60000 ms.
2. 10 seconds later TIPC still broke even though the network got recovered.
2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link
<1.1.14:bond0-1.1.16:eth2>, peer not responding
2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link
<1.1.14:bond0-1.1.16:eth2> on network plane A
2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact with
<1.1.16>
3. SC found the PLD left the cluster.
2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id
25 (change:4, dest:296935520731140)
2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135 went down.
Not sending track callback for agents on that node
2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global discard node
received for nodeId:10e0f pid:3516
2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer disconnected 15
<0, 10e0f(down)> (MsgQueueService69135)
2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node 'PLD0114' left
the cluster
4. One more second later, the TIPC link also got recovered.
2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established link
<1.1.14:bond0-1.1.16:eth2> on network plane A
5. However, PLD was still impacted by the network issue and was trying to go
into ‘SC Absent Mode’.
2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD NEW_ACTIVE,
adest:1
2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending node up
due to NCSMDS_NEW_ACTIVE
2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU states sent
2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU states sent
2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep done
registering IMMND with MDS
2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER saClmDispatch
Failed with error 9
2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM handle.
Reinitializing.
2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS IN
REGISTERING IMMND WITH MDS
2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me
highestProcessed:26209 highestReceived:26209
2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD service is UP
... ScAbsenseAllowed?:315360000 introduced?:2
2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS:
mds_register_callback: dest 10e0fb03c0010 already exist
2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me
highestProcessed:26209 highestReceived:26209
2018-02-16T17:56:11.062053+00:00 pld0114 osafamfnd[3626]: NO 25 CSICOMP states
synced
2018-02-16T17:56:11.062102+00:00 pld0114 osafamfnd[3626]: NO 28 SU states sent
2018-02-16T17:56:11.064418+00:00 pld0114 osafimmnd[3516]: ER MESSAGE:26438 OUT
OF ORDER my highest processed:26209 - exiting
2018-02-16T17:56:11.160121+00:00 pld0114 osafckptnd[3697]: NO CLM selection
object was updated. (12)
2018-02-16T17:56:11.166764+00:00 pld0114 osafamfnd[3626]: NO saClmDispatch
BAD_HANDLE
2018-02-16T17:56:11.167030+00:00 pld0114 osafamfnd[3626]: NO
'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' component restart probation timer
started (timeout: 60000000000 ns)
2018-02-16T17:56:11.167102+00:00 pld0114 osafamfnd[3626]: NO Restarting a
component of 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
2018-02-16T17:56:11.167135+00:00 pld0114 osafamfnd[3626]: NO
'safComp=IMMND,safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' faulted due to
'avaDown' : Recovery is 'componentRestart'
6. SC received messages from the PLD, then it forced the PLD to reboot(due to
the node sync timeout?).
2018-02-16T17:56:11.058121+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id
25 (change:3, dest:296935520731140)
2018-02-16T17:56:11.058515+00:00 scm2 osafsmfd[3391]: ER saClmClusterNodeGet
failed, rc=SA_AIS_ERR_NOT_EXIST (12)
2018-02-16T17:56:11.059607+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: write
failed - Bad file descriptor
2018-02-16T17:56:11.060307+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: write
failed - Bad file descriptor
2018-02-16T17:56:11.060811+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for
IMMND process at node 10e0f old epoch: 0 new epoch:6
2018-02-16T17:56:11.062673+00:00 scm2 osafamfd[3312]: NO Receive message with
event type:12, msg_type:31, from node:10e0f, msg_id:0
2018-02-16T17:56:11.065743+00:00 scm2 osafsmfd[3391]: WA proc_mds_info: SMFND
UP failed
2018-02-16T17:56:11.067053+00:00 scm2 osafamfd[3312]: NO Receive message with
event type:13, msg_type:32, from node:10e0f, msg_id:0
2018-02-16T17:56:11.073149+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id
25 (change:4, dest:296935520731140)
2018-02-16T17:56:11.073751+00:00 scm2 osafimmnd[3106]: NO Global discard node
received for nodeId:10e0f pid:3516
2018-02-16T17:56:11.169774+00:00 scm2 osafamfd[3312]: WA avd_msg_sanity_chk:
invalid msg id 2, msg type 8, from 10e0f should be 1
2018-02-16T17:56:11.170443+00:00 scm2 osafamfd[3312]: WA avd_msg_sanity_chk:
invalid msg id 3, msg type 8, from 10e0f should be 1
2018-02-16T17:56:21.167793+00:00 scm2 osafamfd[3312]: NO NodeSync timeout
2018-02-16T17:56:41.172730+00:00 scm2 osafamfd[3312]: NO Received node_up from
10e0f: msg_id 1
2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot
order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late
node_up_msg after cluster startup timeout
2018-02-16T17:56:41.272684+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id
25 (change:3, dest:296933047877705)
2018-02-16T17:56:41.478486+00:00 scm2 osafimmd[3095]: NO Node 10e0f request
sync sync-pid:29026 epoch:0
2018-02-16T17:56:43.714855+00:00 scm2 osafimmnd[3106]: NO Announce sync, epoch:7
2018-02-16T17:56:43.714960+00:00 scm2 osafimmnd[3106]: NO SERVER STATE:
IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
2018-02-16T17:56:43.715406+00:00 scm2 osafimmd[3095]: NO Successfully announced
sync. New ruling epoch:7
2018-02-16T17:56:43.715498+00:00 scm2 osafimmnd[3106]: NO NODE STATE->
IMM_NODE_R_AVAILABLE
2018-02-16T17:56:43.826039+00:00 scm2 osafimmloadd: NO Sync starting
2018-02-16T17:56:56.278337+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id
25 (change:4, dest:296933047877705)
2018-02-16T17:56:56.279314+00:00 scm2 osafamfd[3312]: NO Node 'PLD0114' left
the cluster
2018-02-16T17:56:56.283580+00:00 scm2 osafimmnd[3106]: NO Global discard node
received for nodeId:10e0f pid:29026
2018-02-16T17:56:58.705379+00:00 scm2 osafimmloadd: IN Synced 6851 objects in
total
2018-02-16T17:56:58.705750+00:00 scm2 osafimmnd[3106]: NO NODE STATE->
IMM_NODE_FULLY_AVAILABLE 18455
2018-02-16T17:56:58.707065+00:00 scm2 osafimmnd[3106]: NO Epoch set to 7 in
ImmModel
2018-02-16T17:56:58.707274+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for
IMMND process at node 1100f old epoch: 6 new epoch:7
2018-02-16T17:56:58.707833+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for
IMMND process at node 1010f old epoch: 6 new epoch:7
2018-02-16T17:56:58.708905+00:00 scm2 osafimmloadd: NO Sync ending normally
2018-02-16T17:56:58.802050+00:00 scm2 osafimmnd[3106]: NO SERVER STATE:
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY
7. On the PLD side, seems nothing happened there from 2018-02-16T17:56:11 to
2018-02-16T17:56:41 except an error, then it got the reboot command from SC and
thus it reboot itself.
2018-02-16T17:56:41.172501+00:00 pld0114 osafamfnd[3626]: CR saImmOmInitialize
failed. Use previous value of nodeName.
2018-02-16T17:56:41.174650+00:00 pld0114 osafamfnd[3626]: NO Received reboot
order, ordering reboot now!
2018-02-16T17:56:41.174711+00:00 pld0114 osafamfnd[3626]: Rebooting OpenSAF
NodeId = 69135 EE Name = , Reason: Received reboot order, OwnNodeId = 69135,
SupervisionTime = 0
2018-02-16T17:56:41.268516+00:00 pld0114 osafimmnd[29026]: Started
2018-02-16T17:56:41.276505+00:00 pld0114 osafimmnd[29026]: NO IMMD service is
UP ... ScAbsenseAllowed?:0 introduced?:0
2018-02-16T17:56:41.277035+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE:
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
2018-02-16T17:56:41.378265+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE:
IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
2018-02-16T17:56:41.478558+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE:
IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
2018-02-16T17:56:41.479004+00:00 pld0114 osafimmnd[29026]: NO NODE STATE->
IMM_NODE_ISOLATED
IMO for this case maybe it’s better:
1) on SC side, not to reboot the PLD if it recover quickly from a network break
issue, like this case.
2) on PLD side, stop the process to going into ‘SC Absent Mode’.
But actually I’m not sure if OpenSAF should be able to handle a network break
like this one, I’m also not sure if it is the other problem(the NodeSync
timeout?) cause the reboot, so any comment would be appreciated, thanks.
Regards,
Jianfeng
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users