Hi Minh,
Sorry for the late reply. Somehow updating this ticket does not sends email.
Regrading the limitations: Since #1725 was pushed in 5.1, so these limitations 
are valid in 5.1 also. With that understanding only I stated that this ticket 
should be an enhancement.

Loss of MBCSv checkpointing and RTA can occur only when controllers are up 
irrespective of SC-Absent feature is enabled or not. And we should treat both 
of them as separate issues. MBCSV loss may be a problem in faiover/switchover 
situation in normal cluster. Loss of RTA is not a problem in normal cluster as 
it is for user only. but in SC-Absent cluster it becomes a problem after 
recovery. This issue is for loss of RTA.
**Normal cluster**: We have been solving RTA related issues during failover 
situation (#1141, #2009 etc) around this area, currently enhancement #2252 is 
for that purpose only. Regarding MBCSv chckpointing loss, I guess this is the 
first time such issue being reported. In case of controller swichover or 
failover, we are already continuing admin operation now also with only 
difference AMFD does not reply to client as opId becomes invalid.  In future 
also we will have to continue this way even if there is some missing MBCSv 
update, becuase IMM spec has features like admin operation continuation (not 
supported now). I think, currently, this can be raised as a discussion point 
along with same traces in a separate ticket: How to handle application when 
there is a loss in MBCSv  state from active director to standby director? First 
thing how standby director will know that there is some MBCSV ckpt got missed.
**SC-Absent cluster**: Since AMF recovers SUSIs from IMM, here loss of RTA 
imposes  a major challange. Loss of  MBCSV may impose some challange indirectly 
because controllers may fail in sequence and not simultaneously. I think the 
documentation ticket is itself a proof that we knew about this limitattion of 
readling from IMM. So we need to decide whether we want to fix this small issue 
reported in this ticket or we want to develop a full solution around this RTA 
loss area? In any case, I think we cannot give up the approach taken already in 
#1725 i .e to continue admin op or recovery if SCs are not available for some 
duration. So we should not go to the approach of reverting anything.

Thanks,
Praveen


---

** [tickets:#2210] AMFD: Loss of RT attribute update before headless**

**Status:** accepted
**Milestone:** 5.2.FC
**Created:** Mon Nov 28, 2016 10:18 PM UTC by Minh Hon Chau
**Last Updated:** Tue Jan 24, 2017 12:25 PM UTC
**Owner:** Minh Hon Chau


A loss of IMM RT saAmfSIAdminState update in AMFD has been seen just before 
cluster goes headless. It results in coredump after headless.

One scenario is:
- Issue amf-admin shutdown SI, delay csi quiescing callback
- Stop SCs, release csi quiescing callback
- Restart SCs
Observation: the saAmfSIAdminState is read as UNLOCKED while related SUSI was 
QUIESCED, and coredump as below

~~~
Thread 1 (Thread 0x7fec174a0780 (LWP 493)):
#0  0x00000000004fbfd5 in SG_2N::node_fail_si_oper (this=0x24109d0, 
su=0x2413440) at sg_2n_fsm.cc:3102
        s_susi = 0x8f50000000b
        susi_temp = 0x5fa169
        o_su = 0x2417f98
        __FUNCTION__ = "node_fail_si_oper"
        cb = 0x919240 <_control_block>
#1  0x00000000004fe69c in SG_2N::node_fail (this=0x24109d0, cb=0x919240 
<_control_block>, su=0x2413440) at sg_2n_fsm.cc:
3469
        a_susi = 0x1
        s_susi = 0x7fffedecd2d0
        o_su = 0x5a50bd <AVD_SU::any_susi_fsm_in(unsigned int)+497>
        flag = 2
        __FUNCTION__ = "node_fail"
        su_ha_state = 0
#2  0x0000000000513010 in AVD_SG::failover_absent_assignment (this=0x24109d0) 
at sg.cc:2273
        su = @0x2411330: 0x2413440
        __for_range = std::vector of length 2, capacity 2 = {0x2413440, 
0x24111e0}
        __for_begin = 
        __for_end = 
        __FUNCTION__ = "failover_absent_assignment"
#3  0x000000000043be65 in avd_cluster_tmr_init_evh (cb=0x919240 
<_control_block>, evt=0x7fec04000df0) at cluster.cc:103
        i_sg = 0x24109d0
        it = {first = "safSg=1,safApp=osaftest", second = }
        __FUNCTION__ = "avd_cluster_tmr_init_evh"
        su = 0x0
        node = 0x240f9b0
~~~



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to