[users] Payload card reboot due to a short time network break

Jianfeng Dong Thu, 08 Mar 2018 01:44:59 -0800

Hi,

Several days ago we got a payload card reboot issue in customer field, a PLD 
lost connection with SC for a little while(about 10 seconds), then SC forced 
the PLD to reboot even though the PLD was going into “SC Absent mode”.


System summary:
our product is a system with 2 SC boards and at most 14 PLD cards, running 
OpenSAF 5.1.0 with the feature “SC Absent Mode” enabled, and SC connect with 
PLD via Ethernet and TIPC.

Issue course:
1. PLD’s internal network went down for a hardware/driver problem, but it 
recovered quickly in 2 seconds.

2018-02-16T17:55:58.343287+00:00 pld0114 kernel: bonding: bond0: link status 
definitely down for interface eth0, disabling it
2018-02-16T17:56:00.743201+00:00 pld0114 kernel: bonding: bond0: link status up 
for interface eth0, enabling it in 60000 ms.

2. 10 seconds later TIPC still broke even though the network got recovered.

2018-02-16T17:56:10.050386+00:00 pld0114 kernel: tipc: Resetting link 
<1.1.14:bond0-1.1.16:eth2>, peer not responding
2018-02-16T17:56:10.050428+00:00 pld0114 kernel: tipc: Lost link 
<1.1.14:bond0-1.1.16:eth2> on network plane A
2018-02-16T17:56:10.050440+00:00 pld0114 kernel: tipc: Lost contact with 
<1.1.16>

3. SC found the PLD left the cluster.

2018-02-16T17:56:10.050704+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:4, dest:296935520731140)
2018-02-16T17:56:10.052770+00:00 scm2 osafclmd[3302]: NO Node 69135 went down. 
Not sending track callback for agents on that node
2018-02-16T17:56:10.054411+00:00 scm2 osafimmnd[3106]: NO Global discard node 
received for nodeId:10e0f pid:3516
2018-02-16T17:56:10.054505+00:00 scm2 osafimmnd[3106]: NO Implementer 
disconnected 15 <0, 10e0f(down)> (MsgQueueService69135)
2018-02-16T17:56:10.055158+00:00 scm2 osafamfd[3312]: NO Node 'PLD0114' left 
the cluster

4. One more second later, the TIPC link also got recovered.

2018-02-16T17:56:11.054553+00:00 pld0114 kernel: tipc: Established link 
<1.1.14:bond0-1.1.16:eth2> on network plane A

5. However, PLD was still impacted by the network issue and was trying to go 
into ‘SC Absent Mode’.

2018-02-16T17:56:11.057260+00:00 pld0114 osafamfnd[3626]: NO AVD NEW_ACTIVE, 
adest:1
2018-02-16T17:56:11.057407+00:00 pld0114 osafamfnd[3626]: NO Sending node up 
due to NCSMDS_NEW_ACTIVE
2018-02-16T17:56:11.057684+00:00 pld0114 osafamfnd[3626]: NO 19 SISU states sent
2018-02-16T17:56:11.057715+00:00 pld0114 osafamfnd[3626]: NO 22 SU states sent
2018-02-16T17:56:11.057775+00:00 pld0114 osafimmnd[3516]: NO Sleep done 
registering IMMND with MDS
2018-02-16T17:56:11.058243+00:00 pld0114 osafmsgnd[3665]: ER saClmDispatch 
Failed with error 9
2018-02-16T17:56:11.058283+00:00 pld0114 osafckptnd[3697]: NO Bad CLM handle. 
Reinitializing.
2018-02-16T17:56:11.059054+00:00 pld0114 osafimmnd[3516]: NO SUCCESS IN 
REGISTERING IMMND WITH MDS
2018-02-16T17:56:11.059116+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me 
highestProcessed:26209 highestReceived:26209
2018-02-16T17:56:11.059699+00:00 pld0114 osafimmnd[3516]: NO IMMD service is UP 
... ScAbsenseAllowed?:315360000 introduced?:2
2018-02-16T17:56:11.059932+00:00 pld0114 osafimmnd[3516]: NO MDS: 
mds_register_callback: dest 10e0fb03c0010 already exist
2018-02-16T17:56:11.060297+00:00 pld0114 osafimmnd[3516]: NO Re-introduce-me 
highestProcessed:26209 highestReceived:26209
2018-02-16T17:56:11.062053+00:00 pld0114 osafamfnd[3626]: NO 25 CSICOMP states 
synced
2018-02-16T17:56:11.062102+00:00 pld0114 osafamfnd[3626]: NO 28 SU states sent
2018-02-16T17:56:11.064418+00:00 pld0114 osafimmnd[3516]: ER MESSAGE:26438 OUT 
OF ORDER my highest processed:26209 - exiting
2018-02-16T17:56:11.160121+00:00 pld0114 osafckptnd[3697]: NO CLM selection 
object was updated. (12)
2018-02-16T17:56:11.166764+00:00 pld0114 osafamfnd[3626]: NO saClmDispatch 
BAD_HANDLE
2018-02-16T17:56:11.167030+00:00 pld0114 osafamfnd[3626]: NO 
'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' component restart probation timer 
started (timeout: 60000000000 ns)
2018-02-16T17:56:11.167102+00:00 pld0114 osafamfnd[3626]: NO Restarting a 
component of 'safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' (comp restart count: 1)
2018-02-16T17:56:11.167135+00:00 pld0114 osafamfnd[3626]: NO 
'safComp=IMMND,safSu=PLD0114,safSg=NoRed,safApp=OpenSAF' faulted due to 
'avaDown' : Recovery is 'componentRestart'

6. SC received messages from the PLD, then it forced the PLD to reboot(due to 
the node sync timeout?).

2018-02-16T17:56:11.058121+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:3, dest:296935520731140)
2018-02-16T17:56:11.058515+00:00 scm2 osafsmfd[3391]: ER saClmClusterNodeGet 
failed, rc=SA_AIS_ERR_NOT_EXIST (12)
2018-02-16T17:56:11.059607+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: write 
failed - Bad file descriptor
2018-02-16T17:56:11.060307+00:00 scm2 osafimmd[3095]: ncs_sel_obj_ind: write 
failed - Bad file descriptor
2018-02-16T17:56:11.060811+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for 
IMMND process at node 10e0f old epoch: 0  new epoch:6
2018-02-16T17:56:11.062673+00:00 scm2 osafamfd[3312]: NO Receive message with 
event type:12, msg_type:31, from node:10e0f, msg_id:0
2018-02-16T17:56:11.065743+00:00 scm2 osafsmfd[3391]: WA proc_mds_info: SMFND 
UP failed
2018-02-16T17:56:11.067053+00:00 scm2 osafamfd[3312]: NO Receive message with 
event type:13, msg_type:32, from node:10e0f, msg_id:0
2018-02-16T17:56:11.073149+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:4, dest:296935520731140)
2018-02-16T17:56:11.073751+00:00 scm2 osafimmnd[3106]: NO Global discard node 
received for nodeId:10e0f pid:3516
2018-02-16T17:56:11.169774+00:00 scm2 osafamfd[3312]: WA avd_msg_sanity_chk: 
invalid msg id 2, msg type 8, from 10e0f should be 1
2018-02-16T17:56:11.170443+00:00 scm2 osafamfd[3312]: WA avd_msg_sanity_chk: 
invalid msg id 3, msg type 8, from 10e0f should be 1

2018-02-16T17:56:21.167793+00:00 scm2 osafamfd[3312]: NO NodeSync timeout
2018-02-16T17:56:41.172730+00:00 scm2 osafamfd[3312]: NO Received node_up from 
10e0f: msg_id 1
2018-02-16T17:56:41.172791+00:00 scm2 osafamfd[3312]: WA Sending node reboot 
order to node:safAmfNode=PLD0114,safAmfCluster=myAmfCluster, due to late 
node_up_msg after cluster startup timeout
2018-02-16T17:56:41.272684+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:3, dest:296933047877705)
2018-02-16T17:56:41.478486+00:00 scm2 osafimmd[3095]: NO Node 10e0f request 
sync sync-pid:29026 epoch:0
2018-02-16T17:56:43.714855+00:00 scm2 osafimmnd[3106]: NO Announce sync, epoch:7
2018-02-16T17:56:43.714960+00:00 scm2 osafimmnd[3106]: NO SERVER STATE: 
IMM_SERVER_READY --> IMM_SERVER_SYNC_SERVER
2018-02-16T17:56:43.715406+00:00 scm2 osafimmd[3095]: NO Successfully announced 
sync. New ruling epoch:7
2018-02-16T17:56:43.715498+00:00 scm2 osafimmnd[3106]: NO NODE STATE-> 
IMM_NODE_R_AVAILABLE
2018-02-16T17:56:43.826039+00:00 scm2 osafimmloadd: NO Sync starting
2018-02-16T17:56:56.278337+00:00 scm2 osafimmd[3095]: NO MDS event from svc_id 
25 (change:4, dest:296933047877705)
2018-02-16T17:56:56.279314+00:00 scm2 osafamfd[3312]: NO Node 'PLD0114' left 
the cluster
2018-02-16T17:56:56.283580+00:00 scm2 osafimmnd[3106]: NO Global discard node 
received for nodeId:10e0f pid:29026
2018-02-16T17:56:58.705379+00:00 scm2 osafimmloadd: IN Synced 6851 objects in 
total
2018-02-16T17:56:58.705750+00:00 scm2 osafimmnd[3106]: NO NODE STATE-> 
IMM_NODE_FULLY_AVAILABLE 18455
2018-02-16T17:56:58.707065+00:00 scm2 osafimmnd[3106]: NO Epoch set to 7 in 
ImmModel
2018-02-16T17:56:58.707274+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for 
IMMND process at node 1100f old epoch: 6  new epoch:7
2018-02-16T17:56:58.707833+00:00 scm2 osafimmd[3095]: NO ACT: New Epoch for 
IMMND process at node 1010f old epoch: 6  new epoch:7
2018-02-16T17:56:58.708905+00:00 scm2 osafimmloadd: NO Sync ending normally
2018-02-16T17:56:58.802050+00:00 scm2 osafimmnd[3106]: NO SERVER STATE: 
IMM_SERVER_SYNC_SERVER --> IMM_SERVER_READY

7. On the PLD side, seems nothing happened there from 2018-02-16T17:56:11 to 
2018-02-16T17:56:41 except an error, then it got the reboot command from SC and 
thus it reboot itself.

2018-02-16T17:56:41.172501+00:00 pld0114 osafamfnd[3626]: CR saImmOmInitialize 
failed. Use previous value of nodeName.
2018-02-16T17:56:41.174650+00:00 pld0114 osafamfnd[3626]: NO Received reboot 
order, ordering reboot now!
2018-02-16T17:56:41.174711+00:00 pld0114 osafamfnd[3626]: Rebooting OpenSAF 
NodeId = 69135 EE Name = , Reason: Received reboot order, OwnNodeId = 69135, 
SupervisionTime = 0
2018-02-16T17:56:41.268516+00:00 pld0114 osafimmnd[29026]: Started
2018-02-16T17:56:41.276505+00:00 pld0114 osafimmnd[29026]: NO IMMD service is 
UP ... ScAbsenseAllowed?:0 introduced?:0
2018-02-16T17:56:41.277035+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE: 
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
2018-02-16T17:56:41.378265+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE: 
IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
2018-02-16T17:56:41.478558+00:00 pld0114 osafimmnd[29026]: NO SERVER STATE: 
IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
2018-02-16T17:56:41.479004+00:00 pld0114 osafimmnd[29026]: NO NODE STATE-> 
IMM_NODE_ISOLATED


IMO for this case maybe it’s better:
1) on SC side, not to reboot the PLD if it recover quickly from a network break 
issue, like this case.
2) on PLD side, stop the process to going into ‘SC Absent Mode’.

But actually I’m not sure if OpenSAF should be able to handle a network break 
like this one, I’m also not sure if it is the other problem(the NodeSync 
timeout?) cause the reboot, so any comment would be appreciated, thanks.

Regards,
Jianfeng

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

[users] Payload card reboot due to a short time network break

Reply via email to