[tickets] [opensaf:tickets] #724 imm: sync with payload node resulted in controller reboots

Anders Bjornerstedt Thu, 30 Jan 2014 03:34:23 -0800

Hi Neel, comments below.

Neelakanta Reddy wrote:
> HI AndersBj,
> The procedure (Scenario B below) followed in #724, is not run in the older 
> releases.


Everything poins to this not being a problem introduced in 4.4. so changed 
milestone to 4.2.x. 


>
> In the #724 ticket, with the Huge CCB of 200k is getting applied, a parallel 
> sync has been initiated from a payload.
> Which resulted in PBE restart.
>
> Following are the observations in the shared logs for #724:
>
> when the traces of IMMND is observed
>
> 1.
>
> Jan 16 16:23:02.742591 osafimmnd [4464:ImmModel.cc:4673] >> ccbTerminate
> Jan 16 16:23:02.742599 osafimmnd [4464:ImmModel.cc:4682] T5 terminate the CCB 
> 2
> Jan 16 16:23:02.742613 osafimmnd [4464:ImmModel.cc:4751] T2 Aborting Create 
> of PERF_RDN_CONFIG1 admo:8
> Jan 16 16:23:02.742635 osafimmnd [4464:ImmModel.cc:4762] WA Admin owner id 8 
> does not exist
> *Jan 16 16:23:02.742647 osafimmnd [4464:ImmModel.cc:4751] T2 Aborting Create 
> of PERF_RDN_CONFIG10 admo:8**
> **Jan 16 16:23:02.742667 osafimmnd [4464:ImmModel.cc:4762] WA Admin owner id 
> 8 does not exist**
> **Jan 16 16:23:02.742679 osafimmnd [4464:ImmModel.cc:4751] T2 Aborting Create 
> of PERF_RDN_CONFIG100 admo:8*
> Jan 16 16:23:02.742699 osafimmnd [4464:ImmModel.cc:4762] WA Admin owner id 8 
> does not exist
> Jan 16 16:23:02.742711 osafimmnd [4464:ImmModel.cc:4751] T2 Aborting Create 
> of PERF_RDN_CONFIG1000 admo:8
> Jan 16 16:23:02.742729 osafimmnd [4464:ImmModel.cc:4762] WA Admin owner id 8 
> does not exist
>
> Aborting(removing) each object from IMMND ram is taking about 0.00003 seconds
>
> 2. If this calculation is considered for 200k objects then aborting of 200k 
> must take closely about 6 seconds.
>
> 3. For searching the adminowner vector (using  std::find_if) some objects the 
> time taken is close to 0.200 seconds,
> and this is being repeated for 10 objects that are removed. 

Ok, dont know how you measured that, but it is of course ridiculously high and 
explains why it takes so long.
This is also where I first suspected the performance bug.
But has the system created that many admin-owners ?
The admin-owner-id used in this test was a low figure like '8', which implies 
that not many admin-owners
had ever been created.

So that search must be incredibly inefficient. 

>
>Jan 16 16:23:02.782230 osafimmnd [4464:ImmModel.cc:4751] T2 Aborting Create of 
>PERF_RDN_CONFIG100010 admo:8
>Jan 16 16:23:03.000985 osafimmnd [4464:ImmModel.cc:4762] WA Admin owner id 8 
>does not exist
>
>Jan 16 16:23:03.001164 osafimmnd [4464:ImmModel.cc:4751] T2 Aborting Create of 
>PERF_RDN_CONFIG10002 admo:8
>Jan 16 16:23:03.268866 osafimmnd [4464:ImmModel.cc:4762] WA Admin owner id 8 
>does not exist
>
>Jan 16 16:23:03.269026 osafimmnd [4464:ImmModel.cc:4751] T2 Aborting Create of 
>PERF_RDN_CONFIG100029 admo:8
>Jan 16 16:23:03.536758 osafimmnd [4464:ImmModel.cc:4762] WA Admin owner id 8 
>does not exist
>
>4.  As anders pointed, if the checking of admin owner ID, when removing each 
>object in a CCB can be avoided, then the problem may no have occurred. 

It can normally be avoided since it should be the same admin-owner id for all 
objects in the same ccb.
There is a catch though.
In theory the admin-owner could have been set to the same name but different ID 
by some
other user/admo-handle. This is extreemely unlikely but possible.
The imm spec gives ACK if on an attempt to set admin-owner on an object that 
already has admin-owner set
if the admin-owner name is the same. Unlike implementer-names where only one 
oi-handle can ever be attached
to the same implementer-name at any given "time"==(evs event) in the cluster, 
an admin-owner name can be attached
to many admo-handles at the same time.

But the fix is to only look-up admin-owner if it is different than in the 
previous iteration.
This should be extreemely rare and if it happens, the switch would be rare.
In theory you could have 200K objects and 200K different admin-owner handles, 
but I think we can
ignore such obvious provocations.

>5. when i checked with surender,  is no other applications are running in the 
>cluster.
>
>6.  one more thing needs to be clarified like, is it  C++ library is taking 
>time for finding adminowner id ? 

Probably its the *way* that the search is done. It is not just scanning an 
array for match against an integer.
It is invoking a function for each element and that function is doing the match.

>I am testing, to see if performance results can improve, by removing the 
>checking of adminowner in same CCB. 

You dont really need to verify that. Its obvious.
But the fix cant just remove the admin-owner check.
The lookup just needs to be smarrter (just to avoid it if its the same id as 
the last one.

/AndersBj 


---

** [tickets:#724] imm: sync with payload node resulted in controller reboots**

**Status:** accepted
**Created:** Thu Jan 16, 2014 11:22 AM UTC by surender khetavath
**Last Updated:** Thu Jan 30, 2014 09:57 AM UTC
**Owner:** Anders Bjornerstedt

changeset: 4733

setup: 2 controllers

Test:
Brought up 2 controllers and addes 2Lakh objects(200000). Now started pl-3
Opensaf start on pl3 was not successful. After some time both controllers 
rebooted but pl-3 didnot go for reboot though the controllers are not 
available. 

syslog on sc-1:
Jan 16 16:26:43 SLES-SLOT4 osafamfnd[4536]: NO 
'safComp=IMMND,safSu=SC-1,safSg=NoRed,safApp=OpenSAF' faulted due to 
'healthCheckcallbackTimeout' : Recovery is 'componentRestart'
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4464]: WA Admin owner id 8 does not exist
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4464]: WA Admin owner id 8 does not exist
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4464]: WA Admin owner id 8 does not exist
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4464]: WA Admin owner id 8 does not exist
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4464]: WA Admin owner id 8 does not exist
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4464]: WA Admin owner id 8 does not exist
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4464]: WA Admin owner id 8 does not exist
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4464]: WA Admin owner id 8 does not exist
Jan 16 16:26:43 SLES-SLOT4 osafimmpbed: NO PBE received SIG_TERM, closing db 
handle
Jan 16 16:26:43 SLES-SLOT4 osafimmd[4454]: WA IMMND coordinator at 2010f 
apparently crashed => electing new coord
Jan 16 16:26:43 SLES-SLOT4 osafntfimcnd[4496]: ER saImmOiDispatch() Fail 
SA_AIS_ERR_BAD_HANDLE (9)
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4980]: Started
Jan 16 16:26:43 SLES-SLOT4 osafimmpbed: WA PBE lost contact with parent IMMND - 
Exiting
Jan 16 16:26:43 SLES-SLOT4 osafimmd[4454]: NO New coord elected, resides at 
2020f
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4980]: NO Persistent Back-End capability 
configured, Pbe file:imm.db (suffix may get added)
Jan 16 16:26:43 SLES-SLOT4 osafimmnd[4980]: NO SERVER STATE: 
IMM_SERVER_ANONYMOUS --> IMM_SERVER_CLUSTER_WAITING
Jan 16 16:26:44 SLES-SLOT4 osafimmd[4454]: NO New IMMND process is on ACTIVE 
Controller at 2010f
Jan 16 16:26:44 SLES-SLOT4 osafimmd[4454]: NO Extended intro from node 2010f
Jan 16 16:26:44 SLES-SLOT4 osafimmnd[4980]: NO Fevs count adjusted to 201407 
preLoadPid: 0
Jan 16 16:26:44 SLES-SLOT4 osafimmnd[4980]: NO SERVER STATE: 
IMM_SERVER_CLUSTER_WAITING --> IMM_SERVER_LOADING_PENDING
Jan 16 16:26:44 SLES-SLOT4 osafimmnd[4980]: NO SERVER STATE: 
IMM_SERVER_LOADING_PENDING --> IMM_SERVER_SYNC_PENDING
Jan 16 16:26:44 SLES-SLOT4 osafimmd[4454]: WA IMMND on controller (not 
currently coord) requests sync
Jan 16 16:26:44 SLES-SLOT4 osafimmd[4454]: NO Node 2010f request sync 
sync-pid:4980 epoch:0 
Jan 16 16:26:44 SLES-SLOT4 osafimmnd[4980]: NO NODE STATE-> IMM_NODE_ISOLATED
Jan 16 16:26:53 SLES-SLOT4 osafamfd[4526]: NO Re-initializing with IMM
Jan 16 16:27:08 SLES-SLOT4 osafimmd[4454]: WA IMMND coordinator at 2020f 
apparently crashed => electing new coord
Jan 16 16:27:08 SLES-SLOT4 osafimmd[4454]: ER Failed to find candidate for new 
IMMND coordinator
Jan 16 16:27:08 SLES-SLOT4 osafimmd[4454]: ER Active IMMD has to restart the 
IMMSv. All IMMNDs will restart
Jan 16 16:27:09 SLES-SLOT4 osafimmd[4454]: ER IMM RELOAD  => ensure cluster 
restart by IMMD exit at both SCs, exiting
Jan 16 16:27:09 SLES-SLOT4 osafimmnd[4980]: ER IMMND forced to restart on order 
from IMMD, exiting
Jan 16 16:27:09 SLES-SLOT4 osafamfnd[4536]: NO 
'safComp=IMMND,safSu=SC-1,safSg=NoRed,safApp=OpenSAF' faulted due to 'avaDown' 
: Recovery is 'componentRestart'
Jan 16 16:27:09 SLES-SLOT4 osafamfnd[4536]: NO 
'safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF' faulted due to 'avaDown' : 
Recovery is 'nodeFailfast'
Jan 16 16:27:09 SLES-SLOT4 osafamfnd[4536]: ER 
safComp=IMMD,safSu=SC-1,safSg=2N,safApp=OpenSAF Faulted due to:avaDown Recovery 
is:nodeFailfast
Jan 16 16:27:09 SLES-SLOT4 osafamfnd[4536]: Rebooting OpenSAF NodeId = 131343 
EE Name = , Reason: Component faulted: recovery is node failfast, OwnNodeId = 
131343, SupervisionTime = 60
Jan 16 16:27:09 SLES-SLOT4 opensaf_reboot: Rebooting local node; timeout=60
Jan 16 16:27:11 SLES-SLOT4 kernel: [ 1435.892099] md: stopping all md devices.
Read from remote host 172.1.1.4: Connection reset by peer


console output on pl-3
ps -ef| grep saf
root     16523     1  0 16:22 ?        00:00:00 /bin/sh 
/usr/lib64/opensaf/clc-cli/osaf-transport-monitor
root     16629     1  0 16:27 ?        00:00:00 /usr/lib64/opensaf/osafimmnd 
--tracemask=0xffffffff
root     16860 10365  0 16:39 pts/0    00:00:00 grep saf



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #724 imm: sync with payload node resulted in controller reboots

Reply via email to