Hi Neel, The purpose of the test is to see if our system can continue to run “normally” when in a geographical configuration. That is, both SCs are NOT co-located, but reside thousands of km apart. This is simulated in the lab by adding a delay between the two severs which host the SCs.
What we’re seeing is that when the delay is increased to a certain value, the si-swap command between the two OpenSAF SUs results in an error. [root@sb117vm0 ~]# date ; amf-adm si-swap safSi=SC-2N,safApp=OpenSAF; Tue Mar 14 11:31:41 EDT 2017 error - saImmOmAdminOperationInvoke_2 FAILED: SA_AIS_ERR_TIMEOUT (5) However, the logs show that the action actually completes about 2 seconds after the timeout. Mar 14 11:31:48 sb117vm0 osafimmnd[21104]: WA Timeout on syncronous admin operation 1 Mar 14 11:31:50 sb117vm0 osafimmnd[21104]: NO Implementer disconnected 67 <0, 2020f> (@safAmfService2020f) Mar 14 11:31:50 sb117vm0 osafimmnd[21104]: NO Implementer connected: 72 (safAmfService) <0, 2020f> Mar 14 11:31:50 sb117vm0 osafamfd[21236]: NO Switching Quiesced --> StandBy Mar 14 11:31:50 sb117vm0 osafrded[21057]: NO RDE role set to STANDBY Mar 14 11:31:50 sb117vm0 osafamfd[21236]: NO Controller switch over done I’m trying to determine if there’s some way to delay the immnd time-out so that the si-swap command returns success. Regards, David From: Neelakanta Reddy [mailto:[email protected]] Sent: Friday, March 17, 2017 7:10 AM To: David Hoyt <[email protected]>; [email protected] Subject: Re: [users] si-swap opensaf SUs results in error but the action still completes ________________________________ NOTICE: This email was received from an EXTERNAL sender ________________________________ Hi, comments inline. On 2017/03/16 07:33 PM, David Hoyt wrote: > Some additional info. > > I found out that the users were testing in a lab that had a delay between the > two SC nodes. The delay was added for geographical redundancy testing. > Once the time was reduced, the timeout error for the opensaf swap went away. > > In looking through the osafimmnd log file, I see the following: > Mar 14 11:31:48.320965 osafimmnd [21104:ImmModel.cc:12042] T5 Forcing Adm Req > continuation to expire 609885356033 > ... > Mar 14 11:31:48.601903 osafimmnd [21104:ImmModel.cc:12437] T5 Timeout on > AdministrativeOp continuation 609885356033 tmout:1 > Mar 14 11:31:48.601952 osafimmnd [21104:ImmModel.cc:11311] T5 REQ ADM > CONTINUATION 5069295 FOUND FOR 609885356033 > Mar 14 11:31:48.601987 osafimmnd [21104:immnd_proc.c:1086] WA Timeout on > syncronous admin operation 1 > > > The code around line 12042 of file ImmModel.cc is as follows: > > 12040 for(ci2=sAdmReqContinuationMap.begin(); > ci2!=sAdmReqContinuationMap.end(); ++ci2) { > 12041 if((ci2->second.mTimeout) && (ci2->second.mImplId == implHandle)) { > 12042 TRACE_5("Forcing Adm Req continuation to expire %llu", ci2->first); > 12043 ci2->second.mTimeout = 1; /* one second is minimum timeout. */ > 12044 } > 12045 } > > > Right after the log at line 12042 is generated, the timeout value is updated > to 1 second (line12043). The node where the adminoperation is targeted went down from OpenSAF perspective. Then the minimum timeout of 1 second is updated. > Can I increase this to 2 seconds? OpenSAF, noted the other node as down, increasing to 2 seconds what additional benefit can be achieved? > If so, would it cause any badness? Explain, what is the end result you are targeting. Regards, Neel. > > Regards, > David ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
