Well #721 is not just a case of a "faster" failover, it is a case of an 
*incorrect* failover,
or "unclean" failover.

This because the failover is done before MDS down has been received.
We should revert the FM behavior back to how it was before unless someone can 
come
up with some other way to get the FM to postpone the failover trigger untill 
all the MDS
from the old active has been plugged by the MDS recevier library.

/AndersBj


________________________________
From: Mathi Naickan [mailto:[email protected]]
Sent: den 22 januari 2014 09:21
To: [opensaf:tickets]
Subject: [opensaf:tickets] #722 payloads did not go for reboot when both the 
controllers rebooted


Both 721 and 722 have not been reproduced.
But yes, 722 happened immediately after 721 occurred in the same setup.
However, the problems are different in 721 and 722.
i.e. In 722, even though the controllers went for a reboot, the OpenSAF 
services are still alive on the payload and they should have recieved the 
service down events after which a reboot of the payload would have got 
trigerred.

While 721 most likely is a case of faster failover during which either the down 
events either were not delivered to IMM or where queued up in IMM procedess but 
the assert(coordinator election) happened before processing the down event. I'm 
still on it i.e. 721

________________________________

[tickets:#722] payloads did not go for reboot when both the controllers rebooted

Status: unassigned
Created: Thu Jan 16, 2014 07:36 AM UTC by Sirisha Alla
Last Updated: Wed Jan 22, 2014 07:55 AM UTC
Owner: nobody

The issue is seen on changeset 4733 + patches of CLM corresponding to 
changesets of #220. Continuous failovers are happening when some api 
invocations of IMM application are ongoing. The IMMD has asserted on the new 
active which is reported in the ticket #721

When both controllers got rebooted, the payloads did not get rebooted. Instead 
the opensaf services are up and running. CLM shows that both the payloads are 
not part of cluster. When the payloads are restarted manually, they joined the 
cluster.

PL-3 syslog:

Jan 15 18:23:09 SLES-64BIT-SLOT3 osafimmnd[3550]: NO implementer for class 
'testMA_verifyObjApplNoResponseModCallback_101' is released => class extent is 
UNSAFE
Jan 15 18:23:59 SLES-64BIT-SLOT3 logger: Invoking failover from 
invoke_failover.sh
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS 
message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS 
message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Director Service in 
NOACTIVE state - fevs replies pending:1 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: NO No IMMD service => cluster 
restart
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafamfnd[3572]: NO 
'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' 
: Recovery is 'componentRestart'
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: Started
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: NO Persistent Back-End 
capability configured, Pbe file:imm.db (suffix may get added)
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176901] TIPC: Resetting link 
<1.1.3:eth0-1.1.2:eth0>, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176911] TIPC: Lost link 
<1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176918] TIPC: Lost contact with 
<1.1.2>
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256091] TIPC: Resetting link 
<1.1.3:eth0-1.1.1:eth0>, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256100] TIPC: Lost link 
<1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256106] TIPC: Lost contact with 
<1.1.1>
Jan 15 18:24:25 SLES-64BIT-SLOT3 kernel: [ 6361.425537] TIPC: Established link 
<1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_LOADING_PENDING --> IMM_SERVER_LOADING_CLIENT
Jan 15 18:24:29 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin 
owner 1 does not exist
Jan 15 18:24:36 SLES-64BIT-SLOT3 kernel: [ 6372.473240] TIPC: Established link 
<1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin 
owner 2 does not exist
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO NODE STATE-> 
IMM_NODE_LOADING
Jan 15 18:24:45 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:5000
Jan 15 18:24:46 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:6000
Jan 15 18:24:47 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:7000
Jan 15 18:24:48 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:8000
Jan 15 18:24:49 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:9000

After both the controllers came up following is the status:

SLES-64BIT-SLOT1:~ # immlist safNode=PL-3,safCluster=myClmCluster
Name Type Value(s)
========================================================================
safNode SA_STRING_T safNode=PL-3
saClmNodeLockCallbackTimeout SA_TIME_T 50000000000 (0xba43b7400, Thu Jan 1 
05:30:50 1970)
saClmNodeIsMember SA_UINT32_T
saClmNodeInitialViewNumber SA_UINT64_T
saClmNodeID SA_UINT32_T
saClmNodeEE SA_NAME_T
saClmNodeDisableReboot SA_UINT32_T 0 (0x0)
saClmNodeCurrAddressFamily SA_UINT32_T
saClmNodeCurrAddress SA_STRING_T
saClmNodeBootTimeStamp SA_TIME_T
saClmNodeAdminState SA_UINT32_T 1 (0x1)
saClmNodeAddressFamily SA_UINT32_T
saClmNodeAddress SA_STRING_T
SaImmAttrImplementerName SA_STRING_T safClmService
SaImmAttrClassName SA_STRING_T SaClmNode
SaImmAttrAdminOwnerName SA_STRING_T IMMLOADER

SLES-64BIT-SLOT1:~ # immlist safAmfNode=PL-3,safAmfCluster=myAmfCluster
Name Type Value(s)
========================================================================
safAmfNode SA_STRING_T safAmfNode=PL-3
saAmfNodeSuFailoverMax SA_UINT32_T 2 (0x2)
saAmfNodeSuFailOverProb SA_TIME_T 1200000000000 (0x1176592e000, Thu Jan 1 
05:50:00 1970)
saAmfNodeOperState SA_UINT32_T 2 (0x2)
saAmfNodeFailfastOnTerminationFailure SA_UINT32_T 0 (0x0)
saAmfNodeFailfastOnInstantiationFailure SA_UINT32_T 0 (0x0)
saAmfNodeClmNode SA_NAME_T safNode=PL-3,safCluster=myClmCluster (36)
saAmfNodeCapacity SA_STRING_T
saAmfNodeAutoRepair SA_UINT32_T 1 (0x1)
saAmfNodeAdminState SA_UINT32_T 1 (0x1)
SaImmAttrImplementerName SA_STRING_T safAmfService
SaImmAttrClassName SA_STRING_T SaAmfNode
SaImmAttrAdminOwnerName SA_STRING_T IMMLOADER

SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # /etc/init.d/opensafd status
safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=STANDBY(2)
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
saAmfSISUHAState=ACTIVE(1)
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # cat /etc/opensaf/node_name
PL-3
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # ps -ef | grep saf
root 3538 1 0 17:16 ? 00:00:00 /bin/sh 
/usr/lib64/opensaf/clc-cli/osaf-transport-monitor
root 3563 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafclmna 
--tracemask=0xffffffff
root 3572 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafamfnd 
--tracemask=0xffffffff
root 3582 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafsmfnd
root 3591 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafmsgnd
root 3608 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osaflcknd
root 3617 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafckptnd
root 3626 1 0 17:16 ? 00:00:00 /usr/lib64/opensaf/osafamfwd
root 6827 1 1 18:24 ? 00:00:13 /usr/lib64/opensaf/osafimmnd 
--tracemask=0xffffffff
root 7490 3073 0 18:42 pts/0 00:00:00 grep saf

Same is with PL-4. Attached the AMF traces.

________________________________

Sent from sourceforge.net because you indicated interest in 
https://sourceforge.net/p/opensaf/tickets/722/

To unsubscribe from further messages, please visit 
https://sourceforge.net/auth/subscriptions/



---

** [tickets:#722] payloads did not go for reboot when both the controllers 
rebooted**

**Status:** unassigned
**Created:** Thu Jan 16, 2014 07:36 AM UTC by Sirisha Alla
**Last Updated:** Wed Jan 22, 2014 08:20 AM UTC
**Owner:** nobody

The issue is seen on changeset 4733 + patches of CLM corresponding to 
changesets of #220. Continuous failovers are happening when some api 
invocations of IMM application are ongoing. The IMMD has asserted on the new 
active which is reported in the ticket #721

When both controllers got rebooted, the payloads did not get rebooted. Instead 
the opensaf services are up and running. CLM shows that both the payloads are 
not part of cluster. When the payloads are restarted manually, they joined the 
cluster.

PL-3 syslog:

Jan 15 18:23:09 SLES-64BIT-SLOT3 osafimmnd[3550]: NO implementer for class 
'testMA_verifyObjApplNoResponseModCallback_101' is released => class extent is 
UNSAFE
Jan 15 18:23:59 SLES-64BIT-SLOT3 logger: Invoking failover from 
invoke_failover.sh
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS 
message:92993
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA DISCARD DUPLICATE FEVS 
message:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Error code 2 returned for 
message type 57 - ignoring
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: WA Director Service in 
NOACTIVE state - fevs replies pending:1 fevs highest processed:92994
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[3550]: NO No IMMD service => cluster 
restart
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafamfnd[3572]: NO 
'safComp=IMMND,safSu=PL-3,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' 
: Recovery is 'componentRestart'
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: Started
Jan 15 18:24:01 SLES-64BIT-SLOT3 osafimmnd[6827]: NO Persistent Back-End 
capability configured, Pbe file:imm.db (suffix may get added)
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176901] TIPC: Resetting link 
<1.1.3:eth0-1.1.2:eth0>, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176911] TIPC: Lost link 
<1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.176918] TIPC: Lost contact with 
<1.1.2>
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256091] TIPC: Resetting link 
<1.1.3:eth0-1.1.1:eth0>, peer not responding
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256100] TIPC: Lost link 
<1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 15 18:24:07 SLES-64BIT-SLOT3 kernel: [ 6343.256106] TIPC: Lost contact with 
<1.1.1>
Jan 15 18:24:25 SLES-64BIT-SLOT3 kernel: [ 6361.425537] TIPC: Established link 
<1.1.3:eth0-1.1.2:eth0> on network plane A
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 15 18:24:27 SLES-64BIT-SLOT3 osafimmnd[6827]: NO SERVER STATE: 
IMM_SERVER_LOADING_PENDING --> IMM_SERVER_LOADING_CLIENT
Jan 15 18:24:29 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin 
owner 1 does not exist
Jan 15 18:24:36 SLES-64BIT-SLOT3 kernel: [ 6372.473240] TIPC: Established link 
<1.1.3:eth0-1.1.1:eth0> on network plane A
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO ERR_BAD_HANDLE: Admin 
owner 2 does not exist
Jan 15 18:24:39 SLES-64BIT-SLOT3 osafimmnd[6827]: NO NODE STATE-> 
IMM_NODE_LOADING
Jan 15 18:24:45 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:5000
Jan 15 18:24:46 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:6000
Jan 15 18:24:47 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:7000
Jan 15 18:24:48 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:8000
Jan 15 18:24:49 SLES-64BIT-SLOT3 osafimmnd[6827]: WA Number of objects in IMM 
is:9000

After both the controllers came up following is the status:

SLES-64BIT-SLOT1:~ # immlist safNode=PL-3,safCluster=myClmCluster
Name                                               Type         Value(s)
========================================================================
safNode                                            SA_STRING_T  safNode=PL-3
saClmNodeLockCallbackTimeout                       SA_TIME_T    50000000000 
(0xba43b7400, Thu Jan  1 05:30:50 1970)
saClmNodeIsMember                                  SA_UINT32_T  <Empty>
saClmNodeInitialViewNumber                         SA_UINT64_T  <Empty>
saClmNodeID                                        SA_UINT32_T  <Empty>
saClmNodeEE                                        SA_NAME_T    <Empty>
saClmNodeDisableReboot                             SA_UINT32_T  0 (0x0)
saClmNodeCurrAddressFamily                         SA_UINT32_T  <Empty>
saClmNodeCurrAddress                               SA_STRING_T  <Empty>
saClmNodeBootTimeStamp                             SA_TIME_T    <Empty>
saClmNodeAdminState                                SA_UINT32_T  1 (0x1)
saClmNodeAddressFamily                             SA_UINT32_T  <Empty>
saClmNodeAddress                                   SA_STRING_T  <Empty>
SaImmAttrImplementerName                           SA_STRING_T  safClmService
SaImmAttrClassName                                 SA_STRING_T  SaClmNode
SaImmAttrAdminOwnerName                            SA_STRING_T  IMMLOADER

SLES-64BIT-SLOT1:~ # immlist safAmfNode=PL-3,safAmfCluster=myAmfCluster
Name                                               Type         Value(s)
========================================================================
safAmfNode                                         SA_STRING_T  safAmfNode=PL-3
saAmfNodeSuFailoverMax                             SA_UINT32_T  2 (0x2)
saAmfNodeSuFailOverProb                            SA_TIME_T    1200000000000 
(0x1176592e000, Thu Jan  1 05:50:00 1970)
saAmfNodeOperState                                 SA_UINT32_T  2 (0x2)
saAmfNodeFailfastOnTerminationFailure              SA_UINT32_T  0 (0x0)
saAmfNodeFailfastOnInstantiationFailure            SA_UINT32_T  0 (0x0)
saAmfNodeClmNode                                   SA_NAME_T    
safNode=PL-3,safCluster=myClmCluster (36)
saAmfNodeCapacity                                  SA_STRING_T  <Empty>
saAmfNodeAutoRepair                                SA_UINT32_T  1 (0x1)
saAmfNodeAdminState                                SA_UINT32_T  1 (0x1)
SaImmAttrImplementerName                           SA_STRING_T  safAmfService
SaImmAttrClassName                                 SA_STRING_T  SaAmfNode
SaImmAttrAdminOwnerName                            SA_STRING_T  IMMLOADER

SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # /etc/init.d/opensafd status
safSISU=safSu=SC-1\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed2,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-2\,safSg=NoRed\,safApp=OpenSAF,safSi=NoRed1,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
safSISU=safSu=SC-1\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
        saAmfSISUHAState=STANDBY(2)
safSISU=safSu=SC-2\,safSg=2N\,safApp=OpenSAF,safSi=SC-2N,safApp=OpenSAF
        saAmfSISUHAState=ACTIVE(1)
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # cat /etc/opensaf/node_name
PL-3
SLES-64BIT-SLOT3:/opt/goahead/tetware/opensaffire # ps -ef | grep saf
root      3538     1  0 17:16 ?        00:00:00 /bin/sh 
/usr/lib64/opensaf/clc-cli/osaf-transport-monitor
root      3563     1  0 17:16 ?        00:00:00 /usr/lib64/opensaf/osafclmna 
--tracemask=0xffffffff
root      3572     1  0 17:16 ?        00:00:00 /usr/lib64/opensaf/osafamfnd 
--tracemask=0xffffffff
root      3582     1  0 17:16 ?        00:00:00 /usr/lib64/opensaf/osafsmfnd
root      3591     1  0 17:16 ?        00:00:00 /usr/lib64/opensaf/osafmsgnd
root      3608     1  0 17:16 ?        00:00:00 /usr/lib64/opensaf/osaflcknd
root      3617     1  0 17:16 ?        00:00:00 /usr/lib64/opensaf/osafckptnd
root      3626     1  0 17:16 ?        00:00:00 /usr/lib64/opensaf/osafamfwd
root      6827     1  1 18:24 ?        00:00:13 /usr/lib64/opensaf/osafimmnd 
--tracemask=0xffffffff
root      7490  3073  0 18:42 pts/0    00:00:00 grep saf


Same is with PL-4. Attached the AMF traces.



---

Sent from sourceforge.net because [email protected] is 
subscribed to http://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
http://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to