See comments [LL] Thanks. -----Original Message----- From: Gary Lee [mailto:gary....@dektech.com.au] Sent: Tuesday, September 10, 2019 11:43 PM To: William R Elliott <william.elli...@netcracker.com>; opensaf-users@lists.sourceforge.net Cc: David S Thompson <david.thomp...@netcracker.com>; Lisa Ann Lentz-Liddell <lisa.a.lentz-lidd...@netcracker.com> Subject: Re: [users] Standby controller switching to active testing
[External Email] ________________________________ Hi Please see [GL] On 11/9/19 6:41 am, William R Elliott wrote: > Hello, > We are using OpenSAF 5.1.0 > Test of the active controller failing over to the standby: > > Cluster has 2 controller nodes (rbm-fe-s1-h1, rbm-fe-s2-h1) and 2 > payload nodes (rbm-fe-s1-h2, rbm-fe- s2-h2). > When starting the test, rbm-fe-s2-h1 is the active controller. > > > Active controller rbm-fe-s2-h1: > > Sep 10 20:21:00 rbm-fe-s2-h1 opensafd: Stopping OpenSAF Services Sep > 10 20:21:00 rbm-fe-s2-h1 osafamfnd[28597]: NO Shutdown initiated Sep > 10 20:21:00 rbm-fe-s2-h1 osafamfnd[28597]: NO Removing assignments from AMF > components > ..... SU terminating > Sep 10 20:21:02 rbm-fe-s2-h1 osafamfnd[28597]: NO Removed assignments > from AMF components Sep 10 20:21:02 rbm-fe-s2-h1 osafamfnd[28597]: NO > Terminating all AMF components Sep 10 20:21:02 rbm-fe-s2-h1 > osafimmd[28493]: exiting for shutdown Sep 10 20:21:02 rbm-fe-s2-h1 > osafckptd[28629]: exiting for shutdown Sep 10 20:21:02 rbm-fe-s2-h1 > osafimmnd[28511]: NO Implementer locally disconnected. Marking it as > doomed 21 <844, 2450f> (safCheckPointService) Sep 10 20:21:02 > rbm-fe-s2-h1 osafimmnd[28511]: WA DISCARD DUPLICATE FEVS message:18065 > Sep 10 20:21:02 rbm-fe-s2-h1 osafimmnd[28511]: WA Error code 2 > returned for message type 82 - ignoring Sep 10 20:21:02 rbm-fe-s2-h1 > osafimmnd[28511]: WA DISCARD DUPLICATE FEVS message:18066 Sep 10 > 20:21:02 rbm-fe-s2-h1 osafimmnd[28511]: WA Error code 2 returned for > message type 82 - ignoring Sep 10 20:21:02 rbm-fe-s2-h1 > osafrded[28460]: exiting for shutdown Sep 10 20:21:02 rbm-fe-s2-h1 > osafckptnd[28667]: exiting for shutdown Sep 10 20:21:02 rbm-fe-s2-h1 > osafclmna[28444]: exiting for shutdown Sep 10 20:21:02 rbm-fe-s2-h1 > osaffmd[28476]: exiting for shutdown Sep 10 20:21:02 rbm-fe-s2-h1 > osafsmfnd[28832]: exiting for shutdown ... all the osaf processes > exitting Sep 10 20:21:02 rbm-fe-s2-h1 osafimmnd[28511]: exiting for > shutdown Sep 10 20:21:07 rbm-fe-s2-h1 osafamfnd[28597]: NO Terminated > all AMF components Sep 10 20:21:07 rbm-fe-s2-h1 osafamfnd[28597]: NO > Shutdown completed, exiting Sep 10 20:21:07 rbm-fe-s2-h1 IGMP: AL AMF > Node Director is down, terminate this process Sep 10 20:21:07 > rbm-fe-s2-h1 UDRU: AL AMF Node Director is down, terminate this > process Sep 10 20:21:13 rbm-fe-s2-h1 opensafd: OpenSAF services > successfully stopped > > 1. Why does it take 13 seconds to fully shutdown the active controller? [GL] This looks like a graceful shutdown, so applications receive callbacks to shutdown gracefully. This may take some time, and logs are also flushed. [LL] Is there any way to lessen this time? > > Standby controller becomes the active: > > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmd[30010]: NO MDS event from svc_id > 24 (change:1, dest:13) Sep 10 20:21:02 rbm-fe-s1-h1 osafimmd[30010]: > NO MDS event from svc_id 24 (change:6, dest:13) Sep 10 20:21:02 > rbm-fe-s1-h1 osafimmd[30010]: WA IMMD lost contact with peer IMMD > (NCSMDS_RED_DOWN) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: WA DISCARD DUPLICATE > FEVS message:18065 Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: WA > Error code 2 returned for message type 82 - ignoring Sep 10 20:21:02 > rbm-fe-s1-h1 osafimmnd[30028]: WA DISCARD DUPLICATE FEVS message:18066 > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: WA Error code 2 > returned for message type 82 - ignoring Sep 10 20:21:02 rbm-fe-s1-h1 > osafrded[29977]: NO Peer down on node 0x2450f Sep 10 20:21:02 > rbm-fe-s1-h1 osafimmd[30010]: NO MDS event from svc_id 25 (change:4, > dest:638880680275807) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmd[30010]: WA IMMND DOWN on active > controller 45 detected at standby immd!! 28. Possible failover Sep 10 > 20:21:02 rbm-fe-s1-h1 osafimmd[30010]: NO Skipping re-send of fevs > message 18065 since it has recently been resent. > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmd[30010]: NO Skipping re-send of > fevs message 18066 since it has recently been resent. > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Global discard node > received for nodeId:2450f > pid:28511 > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 12 <0, 2450f(down)> > (MsgQueueService148751) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 16 <0, 2450f(down)> > (safLogService) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 19 <0, 2450f(down)> > (@safLogService_appl) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 17 <0, 2450f(down)> > (safClmService) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 18 <0, 2450f(down)> > (safAmfService) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 24 <0, 2450f(down)> > (safEvtService) > ... > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 25 <0, 2450f(down)> > (safLckService) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 20 <0, 2450f(down)> > (safMsgGrpService) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 23 <0, 2450f(down)> > (safSmfService) > Sep 10 20:21:02 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 21 <0, 2450f(down)> > (safCheckPointService) > Sep 10 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: NO > 'safSu=amfIgniteRaterSU1.1,safSg=amfIgniteRaterSG1,safApp=olcApp' > component restart probation timer started (timeout: 1000000 ns) Sep 10 > 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: NO Restarting a component of > 'safSu=amfIgniteRaterSU1.1,safSg=amfIgniteRaterSG1,safApp=olcApp' > (comp restart count: 1) Sep 10 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: > NO > 'safComp=amfRaterComp1.1.3,safSu=amfIgniteRaterSU1.1,safSg=amfIgniteRaterSG1,safApp=olcApp' > faulted due to 'errorReport' : Recovery is 'componentRestart' > Sep 10 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: NO saAmfSUFailover is > true for 'safSu=amfCacheIgniteSU1.1,safSg=amfCacheIgniteSG1,safApp=olcApp' > Sep 10 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: NO SU failover > probation timer started (timeout: 0 ns) Sep 10 20:21:02 rbm-fe-s1-h1 > osafamfnd[30116]: NO Performing failover of > 'safSu=amfCacheIgniteSU1.1,safSg=amfCacheIgniteSG1,safApp=olcApp' (SU > failover count: 1) Sep 10 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: NO > 'safComp=amfCacheComp1.1.1,safSu=amfCacheIgniteSU1.1,safSg=amfCacheIgniteSG1,safApp=olcApp' > recovery action escalated from 'componentFailover' to 'suFailover' > Sep 10 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: NO > 'safComp=amfCacheComp1.1.1,safSu=amfCacheIgniteSU1.1,safSg=amfCacheIgniteSG1,safApp=olcApp' > faulted due to 'errorReport' : Recovery is 'suFailover' > Sep 10 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: NO Terminating > components of > 'safSu=amfCacheIgniteSU1.1,safSg=amfCacheIgniteSG1,safApp=olcApp'(abru > ptly & unordered) Sep 10 20:21:02 rbm-fe-s1-h1 osafamfnd[30116]: NO > 'safSu=amfCacheIgniteSU1.1,safSg=amfCacheIgniteSG1,safApp=olcApp' > Presence State INSTANTIATED => TERMINATING ... > Sep 10 20:21:13 rbm-fe-s1-h1 osafdtmd[29938]: NO Lost contact with > 'rbm-fe-s2-h1' > Sep 10 20:21:13 rbm-fe-s1-h1 osaffmd[29993]: NO Node Down event for node id > 2450f: > Sep 10 20:21:13 rbm-fe-s1-h1 osaffmd[29993]: NO Current role: STANDBY > Sep 10 20:21:13 rbm-fe-s1-h1 osaffmd[29993]: Rebooting OpenSAF NodeId > = 148751 EE Name = , > Reason: Received Node Down for peer controller, OwnNodeId = 141327, > SupervisionTime = 0 Sep 10 20:21:13 rbm-fe-s1-h1 osaffmd[29993]: node > reboot failure: exit code 32512 Sep 10 20:21:13 rbm-fe-s1-h1 > osaffmd[29993]: NO Controller Failover: Setting role to ACTIVE Sep 10 > 20:21:13 rbm-fe-s1-h1 osafrded[29977]: NO RDE role set to ACTIVE Sep 10 > 20:21:13 rbm-fe-s1-h1 osafrded[29977]: NO Running > '/usr/lib64/opensaf/opensaf_sc_active' > with 0 argument(s) > Sep 10 20:21:13 rbm-fe-s1-h1 osafimmd[30010]: NO ACTIVE request Sep 10 > 20:21:13 rbm-fe-s1-h1 osafclmd[30082]: NO ACTIVE request Sep 10 > 20:21:13 rbm-fe-s1-h1 osaflogd[30048]: NO ACTIVE request Sep 10 > 20:21:13 rbm-fe-s1-h1 osafimmd[30010]: NO ellect_coord invoke from > rda_callback ACTIVE Sep 10 20:21:13 rbm-fe-s1-h1 osafimmd[30010]: NO > New coord elected, resides at 2280f Sep 10 20:21:13 rbm-fe-s1-h1 > osafimmd[30010]: NO Old active NOT present => send discard node > payload 2520f Sep 10 20:21:13 rbm-fe-s1-h1 osafamfd[30099]: NO > FAILOVER StandBy --> Active Sep 10 20:21:13 rbm-fe-s1-h1 > osafclmd[30082]: safNode=rbm-fe-s2-h2,safCluster=myClmCluster LEFT, > init view=4, cluster view=7 Sep 10 20:21:13 rbm-fe-s1-h1 > osafamfnd[30116]: NO AVD NEW_ACTIVE, adest:1 Sep 10 20:21:13 > rbm-fe-s1-h1 osafimmd[30010]: NO MDS event from svc_id 24 (change:7, > dest:13) Sep 10 20:21:13 rbm-fe-s1-h1 osafimmd[30010]: NO MDS event > from svc_id 24 (change:2, dest:13) Sep 10 20:21:13 rbm-fe-s1-h1 > osafntfd[30065]: NO ACTIVE request Sep 10 20:21:13 rbm-fe-s1-h1 > osafimmnd[30028]: NO This IMMND is now the NEW Coord Sep 10 20:21:13 > rbm-fe-s1-h1 osafimmnd[30028]: NO Global discard node received for > nodeId:2520f > pid:3690 > Sep 10 20:21:13 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 15 <0, 2520f(down)> > (MsgQueueService152079) > Sep 10 20:21:13 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > connected: 28 (safLogService) <824, > 2280f> > Sep 10 20:21:13 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > disconnected 27 <828, 2280f> > (@safAmfService2280f) > Sep 10 20:21:13 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > connected: 29 (safClmService) <827, > 2280f> > Sep 10 20:21:13 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > connected: 30 (safAmfService) <828, > 2280f> > Sep 10 20:21:13 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > (applier) connected: 31 > (@safLogService_appl) <10152, 2280f> > Sep 10 20:21:13 rbm-fe-s1-h1 osafamfd[30099]: NO Node 'rbm-fe-s2-h1' > left the cluster Sep 10 20:21:14 rbm-fe-s1-h1 osafamfd[30099]: NO FAILOVER > StandBy --> Active DONE! > Sep 10 20:21:14 rbm-fe-s1-h1 osafamfnd[30116]: NO Assigning > 'safSi=SC-2N,safApp=OpenSAF' ACTIVE to > 'safSu=rbm-fe-s1-h1,safSg=2N,safApp=OpenSAF' > Sep 10 20:21:14 rbm-fe-s1-h1 osafamfnd[30116]: NO Assigning > 'safSi=amfRMPSI1.1,safApp=olcApp' > ACTIVE to 'safSu=amfRMPSU1.1,safSg=amfRMPSG1,safApp=olcApp' > Sep 10 20:21:14 rbm-fe-s1-h1 osafamfnd[30116]: WA susi_assign_evh: > 'safSu=amfCacheIgniteSU1.1,safSg=amfCacheIgniteSG1,safApp=olcApp' has > no assignments Sep 10 20:21:14 rbm-fe-s1-h1 osafimmnd[30028]: NO > Implementer connected: 32 (safMsgGrpService) <841, 2280f> Sep 10 > 20:21:14 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer connected: 33 > (safCheckPointService) <843, 2280f> > Sep 10 20:21:14 rbm-fe-s1-h1 osafamfnd[30116]: NO Assigned > 'safSi=amfRMPSI1.1,safApp=olcApp' > ACTIVE to 'safSu=amfRMPSU1.1,safSg=amfRMPSG1,safApp=olcApp' > Sep 10 20:21:14 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > connected: 34 (safLckService) <842, > 2280f> > Sep 10 20:21:14 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > connected: 35 (safEvtService) <820, > 2280f> > Sep 10 20:21:14 rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer > connected: 36 > (MsgQueueService152079) <10156, 2280f> Sep 10 20:21:14 rbm-fe-s1-h1 > osafimmnd[30028]: NO Implementer locally disconnected. Marking it as > doomed 36 <10156, 2280f> (MsgQueueService152079) Sep 10 20:21:14 > rbm-fe-s1-h1 osafimmnd[30028]: NO Implementer disconnected 36 <10156, > 2280f> > (MsgQueueService152079) > .... > Sep 10 20:21:14 rbm-fe-s1-h1 osafamfd[30099]: NO Node 'rbm-fe-s2-h2' > left the cluster > > 2. The standby recognizes that the immnd quickly but does not perform > the assignment to ACTIVE > until the same time as on the original ACTIVE stating that it is fully > stopped. [GL] If OpenSAF switched over immediately before it's fully stopped, we could end up in a split brain situation. [LL] During this current switch, there is an outage of service which is seconds long - our processes process 1000s of things a second, this is a loss of many messages. Why does it need to wait for the full shutdown of all of the original ACTIVE items before the STANDBY becomes ACTIVE? OpenSAF should understand that the ACTIVE is being stopped and hence gives up the ACTIVE state and allows for the STANDBY to take control ASAP and then original shuts itself down. > 3. While there is no ACTIVE, the components' IMM and NTF queries are > failing with the retry > error. The components hit the max retries (the code does not retry forever) > and then fail, > restart, fail, restart,.... > Why are the components dependent on the IMM on the controller, > shouldn't it be using the IMM on the node? > > From the IMM documentation: > 2.2.2 IMM Node Director > The IMMND process executes on all nodes (both controller and payload). > The IMMND process contains the IMM repository and is the actual > provider of the IMMSv at the node. All connections and sessions started at > the node are handled by the IMMND at that node. > [GL] IMM is read-only until an active IMM director is present. > > Payload on s2-h2 - the immnd restarted > > [rbm-fe-s2-h2(Lopnsaf)telenet-lab2:/sft/Lopnsaf/HA_ROOT/logs/instantiation] > ps -ef | grep osaf > Lopnsaf 3666 1 0 18:23 ? 00:01:23 /usr/lib64/opensaf/osafdtmd > --tracemask=0xffffffff > root 3677 1 0 18:23 ? 00:00:02 /bin/sh > /usr/lib64/opensaf/clc-cli/osaf-transport-monitor > Lopnsaf 3709 1 0 18:23 ? 00:00:00 /usr/lib64/opensaf/osafclmna > root 3725 1 1 18:23 ? 00:02:17 /usr/lib64/opensaf/osafamfnd > --tracemask=0xffffffff > Lopnsaf 3745 1 0 18:23 ? 00:00:00 /usr/lib64/opensaf/osafamfwd > Lopnsaf 3767 1 0 18:23 ? 00:00:00 /usr/lib64/opensaf/osafckptnd > Lopnsaf 3788 1 0 18:23 ? 00:00:00 /usr/lib64/opensaf/osaflcknd > Lopnsaf 3832 1 0 18:23 ? 00:00:00 /usr/lib64/opensaf/osafmsgnd > root 3854 1 0 18:23 ? 00:00:00 /usr/lib64/opensaf/osafsmfnd > Lopnsaf 11886 1 1 20:21 ? 00:00:22 /usr/lib64/opensaf/osafimmnd > Lopnsaf 26579 15860 0 20:44 pts/3 00:00:00 grep --color=auto osaf > > > Sep 10 20:21:02 rbm-fe-s2-h2 osafimmnd[3690]: ER No IMMD service => > cluster restart, exiting Sep 10 20:21:02 rbm-fe-s2-h2 osafamfnd[3725]: > NO 'safSu=rbm-fe-s2- h2,safSg=NoRed,safApp=OpenSAF' component restart > probation timer started (timeout: 60000000000 > ns) > Sep 10 20:21:02 rbm-fe-s2-h2 osafamfnd[3725]: NO Restarting a > component of 'safSu=rbm-fe-s2- h2,safSg=NoRed,safApp=OpenSAF' (comp > restart count: 1) Sep 10 20:21:02 rbm-fe-s2-h2 osafamfnd[3725]: NO > 'safComp=IMMND,safSu=rbm-fe-s2- h2,safSg=NoRed,safApp=OpenSAF' faulted due to > 'avaDown' : Recovery is 'componentRestart' > Sep 10 20:21:02 rbm-fe-s2-h2 osafimmnd[11886]: Started > /////// This does not register with > the active controller and all of the processes on this node, start to > bounce Sep 10 20:21:02 rbm-fe-s2-h2 osafamfnd[3725]: NO > 'safSu=amfIgniteRaterSU1.4,safSg=amfIgniteRaterSG1,safApp=olcApp' > component restart probation timer started (timeout: 1000000 ns) Sep 10 > 20:21:02 rbm-fe-s2-h2 osafamfnd[3725]: NO Restarting a component of > 'safSu=amfIgniteRaterSU1.4,safSg=amfIgniteRaterSG1,safApp=olcApp' > (comp restart count: 1) Sep 10 20:21:02 rbm-fe-s2-h2 osafamfnd[3725]: > NO > 'safComp=amfRaterComp1.4.1,safSu=amfIgniteRaterSU1.4,safSg=amfIgniteRaterSG1,safApp=olcApp' > faulted due to 'errorReport' : Recovery is 'componentRestart' > Sep 10 20:21:02 rbm-fe-s2-h2 osafamfnd[3725]: NO > 'safSu=amfIgniteRaterSU1.4,safSg=amfIgniteRaterSG1,safApp=olcApp' > Component or SU restart probation timer expired Sep 10 20:21:02 > rbm-fe-s2-h2 osafamfnd[3725]: NO saAmfSUFailover is true for > 'safSu=amfIgniteUDRUSU2.6,safSg=amfIgniteUDRUSG2,safApp=olcApp' > Sep 10 20:21:02 rbm-fe-s2-h2 osafamfnd[3725]: NO SU failover probation > timer started (timeout: 0 ns) Sep 10 20:21:02 rbm-fe-s2-h2 > osafamfnd[3725]: NO Performing failover of > 'safSu=amfIgniteUDRUSU2.6,safSg=amfIgniteUDRUSG2,safApp=olcApp' (SU > failover count: 1) > > 4. One of the payloads for some reason had its IMMD service die. It > was restarted while there > was no ACTIVE controller. It never registers to the active controller and > all of the processes on > this payload bounce forever. > The payload node was stopped and restarted and the components stopped > bouncing. > The other payload node shown below did not have this problem. [GL] I'm not too sure based on the available log snippets. Gary > > Payload on s1-h2 is stable: > > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: WA DISCARD DUPLICATE > FEVS message:18065 Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: WA > Error code 2 returned for message type 82 - ignoring Sep 10 20:21:02 > rbm-fe-s1-h2 osafimmnd[29109]: WA DISCARD DUPLICATE FEVS message:18066 > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: WA Error code 2 > returned for message type 82 - ignoring Sep 10 20:21:02 rbm-fe-s1-h2 > osafimmnd[29109]: NO Global discard node received for nodeId:2450f > pid:28511 > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 12 <0, 2450f(down)> > (MsgQueueService148751) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 21 <0, 2450f(down)> > (safCheckPointService) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 23 <0, 2450f(down)> > (safSmfService) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 20 <0, 2450f(down)> > (safMsgGrpService) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 25 <0, 2450f(down)> > (safLckService) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 24 <0, 2450f(down)> > (safEvtService) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 18 <0, 2450f(down)> > (safAmfService) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 17 <0, 2450f(down)> > (safClmService) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 19 <0, 2450f(down)> > (@safLogService_appl) > Sep 10 20:21:02 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 16 <0, 2450f(down)> > (safLogService) > Sep 10 20:21:02 rbm-fe-s1-h2 osafamfnd[29144]: NO saAmfSUFailover is > true for 'safSu=amfIgniteRARSU1.4,safSg=amfIgniteRARSG1,safApp=olcApp' > Sep 10 20:21:02 rbm-fe-s1-h2 osafamfnd[29144]: NO SU failover > probation timer started (timeout: 0 ns) Sep 10 20:21:02 rbm-fe-s1-h2 > osafamfnd[29144]: NO Performing failover of > 'safSu=amfIgniteRARSU1.4,safSg=amfIgniteRARSG1,safApp=olcApp' (SU > failover count: 1) Sep 10 20:21:02 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safComp=amfRARComp1.4.1,safSu=amfIgniteRARSU1.4,safSg=amfIgniteRARSG1,safApp=olcApp' > recovery action escalated from 'componentFailover' to 'suFailover' > Sep 10 20:21:02 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safComp=amfRARComp1.4.1,safSu=amfIgniteRARSU1.4,safSg=amfIgniteRARSG1 > ,safApp=olcApp' faulted due to 'errorReport' : Recovery is 'suFailover' > Sep 10 20:21:02 rbm-fe-s1-h2 osafamfnd[29144]: NO Terminating > components of > 'safSu=amfIgniteRARSU1.4,safSg=amfIgniteRARSG1,safApp=olcApp'(abruptly & > unordered) .... > Sep 10 20:21:13 rbm-fe-s1-h2 osafdtmd[29085]: NO Lost contact with > 'rbm-fe-s2-h1' > .... > Sep 10 20:21:13 rbm-fe-s1-h2 osafamfnd[29144]: NO AVD NEW_ACTIVE, > adest:1 Sep 10 20:21:13 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safSu=amfIgniteIRPSU1.6,safSg=amfIgniteIRPSG1,safApp=olcApp' > component restart probation timer started (timeout: 1000000 ns) Sep 10 > 20:21:13 rbm-fe-s1-h2 osafamfnd[29144]: NO Restarting a component of > 'safSu=amfIgniteIRPSU1.6,safSg=amfIgniteIRPSG1,safApp=olcApp' (comp > restart count: 1) Sep 10 20:21:13 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safComp=amfIRPComp1.6.1,safSu=amfIgniteIRPSU1.6,safSg=amfIgniteIRPSG1 > ,safApp=olcApp' faulted due to 'avaDown' : Recovery is 'componentRestart' > Sep 10 20:21:13 rbm-fe-s1-h2 osafimmnd[29109]: NO Global discard node > received for nodeId:2520f > pid:3690 > Sep 10 20:21:13 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 15 <0, 2520f(down)> > (MsgQueueService152079) > Sep 10 20:21:13 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safSu=amfIgniteIRPSU1.2,safSg=amfIgniteIRPSG1,safApp=olcApp' > component restart probation timer started (timeout: 1000000 ns) Sep 10 > 20:21:13 rbm-fe-s1-h2 osafamfnd[29144]: NO Restarting a component of > 'safSu=amfIgniteIRPSU1.2,safSg=amfIgniteIRPSG1,safApp=olcApp' (comp > restart count: 1) Sep 10 20:21:13 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safComp=amfIRPComp1.2.1,safSu=amfIgniteIRPSU1.2,safSg=amfIgniteIRPSG1 > ,safApp=olcApp' faulted due to 'avaDown' : Recovery is 'componentRestart' > Sep 10 20:21:13 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 28 (safLogService) <0, > 2280f> > Sep 10 20:21:13 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 27 <0, 2280f> > (@safAmfService2280f) > Sep 10 20:21:13 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 29 (safClmService) <0, > 2280f> > Sep 10 20:21:13 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 30 (safAmfService) <0, > 2280f> > Sep 10 20:21:13 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > (applier) connected: 31 > (@safLogService_appl) <0, 2280f> > Sep 10 20:21:13 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safSu=amfIgniteIRPSU2.4,safSg=amfIgniteIRPSG2,safApp=olcApp' > component restart probation timer started (timeout: 1000000 ns) ... > Sep 10 20:21:14 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 32 (safMsgGrpService) <0, 2280f> Sep 10 20:21:14 > rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer connected: 33 > (safCheckPointService) <0, 2280f> > Sep 10 20:21:14 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 34 (safLckService) <0, > 2280f> > Sep 10 20:21:14 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 35 (safEvtService) <0, > 2280f> > Sep 10 20:21:14 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 36 > (MsgQueueService152079) <0, 2280f> > Sep 10 20:21:14 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 36 <0, 2280f> > (MsgQueueService152079) > Sep 10 20:21:14 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 37 (safSmfService) <0, > 2280f> > Sep 10 20:21:17 rbm-fe-s1-h2 osafamfnd[29144]: WA susi_assign_evh: > 'safSu=amfCacheIgniteSU2.2,safSg=amfCacheIgniteSG2,safApp=olcApp' has > no assignments Sep 10 20:21:18 rbm-fe-s1-h2 osafamfnd[29144]: NO > Repair request for > 'safSu=amfIgniteRARSU1.4,safSg=amfIgniteRARSG1,safApp=olcApp' > Sep 10 20:21:18 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safSu=amfIgniteRARSU1.4,safSg=amfIgniteRARSG1,safApp=olcApp' Presence > State UNINSTANTIATED => UNINSTANTIATED Sep 10 20:21:18 rbm-fe-s1-h2 > osafamfnd[29144]: NO Repair request for > 'safSu=amfIgniteUDRUSU1.8,safSg=amfIgniteUDRUSG1,safApp=olcApp' > Sep 10 20:21:18 rbm-fe-s1-h2 osafamfnd[29144]: NO > 'safSu=amfIgniteUDRUSU1.8,safSg=amfIgniteUDRUSG1,safApp=olcApp' > Presence State UNINSTANTIATED => UNINSTANTIATED ... > Sep 10 20:21:22 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > connected: 38 > (MsgQueueService148751) <0, 2280f> > Sep 10 20:21:22 rbm-fe-s1-h2 osafimmnd[29109]: NO Implementer > disconnected 38 <0, 2280f> > (MsgQueueService148751) > ... > > > Thanks. > > > > > ________________________________ > The information transmitted herein is intended only for the person or entity > to which it is addressed and may contain confidential, proprietary and/or > privileged material. Any review, retransmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. > > _______________________________________________ > Opensaf-users mailing list > Opensaf-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/opensaf-users ________________________________ The information transmitted herein is intended only for the person or entity to which it is addressed and may contain confidential, proprietary and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer. _______________________________________________ Opensaf-users mailing list Opensaf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/opensaf-users