Hi Kang-sen,

Before active controller of slot-5 goes for reboot, are you executing 
any CLM related operation for slot-1 payload?
Or when slot-5 was active, are you rebooting payload in slot-1?
As I pointed out: if such a case occurred then issue reported in #1120 
may send stale clm node left callback to slot-1 payload. This issue was 
fixed after 4.4 release.
On new active controller, check clm related error or warning.
If possible share syslogs and from controllers and amfd traces.

Thanks,
Praveen

On 30-Nov-16 7:42 PM, Kang-Sen Lu wrote:
> Hi, Praveen:
>
> I have obtained syslog from another payload, slot-2.
>
> It experienced the same slot-5 link lost. But it did not exit the cluster. Is 
> there way to find out why only slot-1 exit the cluster?
>
> Syslog from slot-2:
>
> Nov 28 05:33:01 BHA-IND-WHF-KK-CAE-2 CRON[64918]: (root) CMD 
> (/usr/share/platform-config/c7000/update-ssh-keys)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623672.680359] tipc: 
> Resetting link <1.1.33:fabric1.96-1.1.81:fabric1.96>, peer not responding
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623672.680366] tipc: Lost 
> link <1.1.33:fabric1.96-1.1.81:fabric1.96> on network plane B
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623672.688330] tipc: 
> Resetting link <1.1.33:ipcbr0-1.1.81:ipcbr0>, peer not responding
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623672.688335] tipc: Lost 
> standby link <1.1.33:ipcbr0-1.1.81:ipcbr0> on network plane C
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: WA DISCARD DUPLICATE 
> FEVS message:69286
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: WA Error code 2 
> returned for message type 57 - ignoring
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: WA DISCARD DUPLICATE 
> FEVS message:69287
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: WA Error code 2 
> returned for message type 57 - ignoring
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Global discard node 
> received for nodeId:10501 pid:35221
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 126 <0, 10501(down)> (safLogService)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 127 <0, 10501(down)> (safClmService)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 128 <0, 10501(down)> (safAmfService)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 125 <0, 10501(down)> (MsgQueueService66817)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 129 <0, 10501(down)> (safMsgGrpService)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 130 <0, 10501(down)> (safCheckPointService)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 132 <0, 10501(down)> (safLckService)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 131 <0, 10501(down)> (safEvtService)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 134 <0, 10501(down)> (safSmfService)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 145 (safClmService) <0, 10a01>
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 146 (safLogService) <0, 10a01>
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 135 <0, 10a01> (@safAmfService10a01)
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623673.000049] tipc: 
> Resetting link <1.1.33:fabric0.96-1.1.81:fabric0.96>, peer not responding
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623673.000054] tipc: Lost 
> link <1.1.33:fabric0.96-1.1.81:fabric0.96> on network plane A
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623673.043989] tipc: 
> Resetting link <1.1.33:ipcbr1-1.1.81:ipcbr1>, peer not responding
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623673.044005] tipc: Lost 
> link <1.1.33:ipcbr1-1.1.81:ipcbr1> on network plane D
> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-2 kernel: [11623673.044007] tipc: Lost 
> contact with <1.1.81>
> Nov 28 05:33:35 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 147 (safAmfService) <0, 10a01>
> Nov 28 05:33:35 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 148 (safMsgGrpService) <0, 10a01>
> Nov 28 05:33:35 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 149 (safCheckPointService) <0, 10a01>
> Nov 28 05:33:35 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 150 (safEvtService) <0, 10a01>
> Nov 28 05:33:35 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 151 (safLckService) <0, 10a01>
> Nov 28 05:33:35 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 152 (safSmfService) <0, 10a01>
> Nov 28 05:33:35 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> connected: 153 (MsgQueueService66817) <0, 10a01>
> Nov 28 05:33:35 BHA-IND-WHF-KK-CAE-2 osafimmnd[23047]: NO Implementer 
> disconnected 153 <0, 10a01> (MsgQueueService66817)
> Nov 28 05:33:40 BHA-IND-WHF-KK-CAE-2 ntpd[38756]: ntpd exiting on signal 15
> Nov 28 05:33:40 BHA-IND-WHF-KK-CAE-2 kernel: [11623681.621106] bonding: 
> DAF_LAG: Could not set fabric0.2102 as active slave; either fabric0.2102 is 
> down or the link is down.
> Nov 28 05:33:40 BHA-IND-WHF-KK-CAE-2 kernel: [11623681.621214] bonding: 
> DAF_LAG: Could not set fabric0.2102 as active slave; either fabric0.2102 is 
> down or the link is down.
> Nov 28 05:33:40 BHA-IND-WHF-KK-CAE-2 kernel: [11623681.622769] bonding: 
> DAF_LAG: setting arp_validate to all (3).
> Nov 28 05:33:40 BHA-IND-WHF-KK-CAE-2 kernel: [11623681.624148] bonding: 
> DAF_LAG: Setting MII monitoring interval to 0.
> Nov 28 05:33:40 BHA-IND-WHF-KK-CAE-2 kernel: [11623681.625472] bonding: 
> DAF_LAG: Setting ARP monitoring interval to 2000.
> Nov 28 05:33:40 BHA-IND-WHF-KK-CAE-2 kernel: [11623681.626845] bonding: 
> DAF_LAG: Setting fabric0.2102 as primary slave.
>
> Thanks.
>
> Kang-sen
>
> -----Original Message-----
> From: Kang-Sen Lu [mailto:k...@anovadata.com]
> Sent: Wednesday, November 30, 2016 8:15 AM
> To: praveen malviya <praveen.malv...@oracle.com>; 
> opensaf-users@lists.sourceforge.net
> Subject: Re: [users] question about payload blade recovery
>
> Hi, Praveen:
>
> Thanks for your reply. Now I know why slot-1 shutdown.
>
> I am not clear what you are talking about in:
>
> Please check syslog on new active controller.
> Are you performing any CLM lock/shutdown operation on the slot-1 node.
>
> I do have the syslog from new active controller saved somewhere. Can you tell 
> me exactly what log text to look for?
>
> My current understanding is after 5 minutes, the slot-5 finished reboot and 
> became standby controller properly. Slot-10 took no time to transition from 
> standby to active, right after slot-5 rebooted.
>
> But slot-1 never rejoin the cluster, until someone explicitly restarted 
> opensaf on slot-1.
>
> Kang-sen
>
> -----Original Message-----
> From: praveen malviya [mailto:praveen.malv...@oracle.com]
> Sent: Wednesday, November 30, 2016 3:44 AM
> To: opensaf-users@lists.sourceforge.net
> Subject: Re: [users] question about payload blade recovery
>
> Hi,
>
> Please see inline with [Praveen]
>
> Thanks,
> Praveen
>
> On 30-Nov-16 1:39 AM, Kang-Sen Lu wrote:
>> We are running opensaf 4.4.0.
>>
>> In our chassis C7000, we have slot-5 as active controller, slot-10 as 
>> standby controller, and slot-1 as payload controller.
>>
>> Somehow, slot-5 rebooted. Applications on slot-1 were terminated, but not 
>> restarted automatically as expected.
>>
>> Here is a piece of syslog from slot-1. I hope someone can point out what 
>> happened to the opensaf on slot-1, and can explain why applications on 
>> slot-1 not restarted as expected.
>>
>> ===============
>> Nov 28 05:33:01 BHA-IND-WHF-KK-CAE-1 CRON[3462]: (root) CMD
>> (/usr/share/platform-config/c7000/update-ssh-keys)
>> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1 kernel: [6402453.489596] tipc:
>> Resetting link <1.1.17:fabric1.96-1.1.81:fabric1.96>, peer not
>> responding Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1 kernel:
>> [6402453.489602] tipc: Lost link <1.1.17:fabric1.96-1.1.81:fabric1.96>
>> on network plane B Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1
>> osafimmnd[38590]: WA DISCARD DUPLICATE FEVS message:69286 Nov 28
>> 05:33:31 BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: WA Error code 2
>> returned for message type 57 - ignoring Nov 28 05:33:31
>> BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: WA DISCARD DUPLICATE FEVS
>> message:69287 Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]:
>> WA Error code 2 returned for message type 57 - ignoring Nov 28
>> 05:33:31 BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: NO Global discard node
>> received for nodeId:10501 pid:35221 Nov 28 05:33:31
>> BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: NO Implementer disconnected 126
>> <0, 10501(down)> (safLogService) Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1
>> osafimmnd[38590]: NO Implementer disconnected 127 <0, 10501(down)>
>> (safClmService) Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]:
>> NO Implementer disconnected 128 <0, 10501(down)> (safAmfService) Nov
>> 28 05:33:31 BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: NO Implementer
>> disconnected 125 <0, 10501(down)> (MsgQueueService66817) Nov 28
>> 05:33:31 BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: NO Implementer
>> disconnected 129 <0, 10501(down)> (safMsgGrpService) Nov 28 05:33:31
>> BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: NO Implementer disconnected 130
>> <0, 10501(down)> (safCheckPointService) Nov 28 05:33:31
>> BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: NO Implementer disconnected 132
>> <0, 10501(down)> (safLckService) Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1
>> osafimmnd[38590]: NO Implementer disconnected 131 <0, 10501(down)>
>> (safEvtService) Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]:
>> NO Implementer disconnected 134 <0, 10501(down)> (safSmfService) Nov
>> 28 05:33:31 BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: NO Implementer
>> connected: 145 (safClmService) <0, 10a01> Nov 28 05:33:31
>> BHA-IND-WHF-KK-CAE-1 osafimmnd[38590]: NO Implementer connected: 146
>> (safLogService) <0, 10a01> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1
>> osafimmnd[38590]: NO Implementer disconnected 135 <0, 10a01>
>> (@safAmfService10a01) Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1
>> osafamfnd[38632]: NO This node has exited the cluster
> [Praveen] This log means CLM node, to which this payload is mapped, has lost 
> CLM cluster membership. When a node loses its CLM membership, AMF cannot 
> provide service to applications hosted on that node and all non OpenSAF 
> components will be terminated.
> Below component is terminated as an outcome of that.
> These components will be re-instantiated when this node will again become a 
> member node.
>
> Please check syslog on new active controller.
> Are you performing any CLM lock/shutdown operation on the slot-1 node.
> I see a ticket #1120 where CLM sends a stale track callback. This was fixed 
> after 4.4 GA release
>
>
>> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1 kernel: [6402453.573512] tipc:
>> Resetting link <1.1.17:fabric0.96-1.1.81:fabric0.96>, peer not
>> responding Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1 kernel:
>> [6402453.573518] tipc: Lost link <1.1.17:fabric0.96-1.1.81:fabric0.96>
>> on network plane A Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1
>> netmonT2FW_clean: Cleanup for CompName:
>> safComp=NetMonT2FW_PL-1,safSu=NetMonT2FWSU_PL-1,safSg=NetMonT2FWSG,saf
>> App=NetMonT2FWApp Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1
>> netmonT1Sweeper_clean: Cleanup for CompName:
>> safComp=NetMonT1Sweeper_PL-1,safSu=NetMonT1SweeperSU_PL-1,safSg=NetMon
>> T1SweeperSG,safApp=NetMonT1SweeperApp
>> Nov 28 05:33:31 BHA-IND-WHF-KK-CAE-1 netmonT1FL_clean: Cleanup for
>> CompName:
>> safComp=NetMonT1FL_PL-1,safSu=NetMonT1FLSU_PL-1,safSg=NetMonT1FLSG,saf
>> App=NetMonT1FLApp
>> ==============
>>
>> The opensaf execduted " netmonT1FL_clean" for terminating netmonT1FL 
>> application, it should execute " netmonT1FL_inst" to restart that 
>> application.
>>
>> Thanks.
>>
>> Kang-sen
>>
>> ----------------------------------------------------------------------
>> -------- _______________________________________________
>> Opensaf-users mailing list
>> Opensaf-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> Opensaf-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Opensaf-users mailing list
> Opensaf-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
Opensaf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to