One way to deal with loss of IMM RT update due to unplanned headless (it was 
planned in this ticket but failed to update to IMM probably due to AMFD's job 
queue) is that AMFD performs headless recovery as a *housekeeping*, for 
example: 
- if AdminState is UNLOCKED (loss case), AMFD will assign ACTIVE/STANDBY 
assignment for existing SU, which could be having QUIESCED assignment
- If AdminState is LOCKED (no loss), AMFD will continue to remove existing 
assignments.
So AMFD does not depend on the update of AdminState before headless, AMFD can 
continue assignment base of AdminState read after headless. In testing point of 
view, this could make a test failed due to loss of update, but in application's 
perspective, I think it's accepted as long as AdminState and all assignments 
make sense together eventually.
This *loss* case can be seen as a case of rollback of admin operation in 
non-headless.

The above idea can only be implemented before #1725, the reason is that #1725 
has now depent on other new RT attributes, not just only AdminState. Therefore, 
a proposed solution for now could be:
- Use avd_saImmOiRtObjectUpdate_sync for RT attributes that needs for headless 
recovery, instead of avd_saImmOiRtObjectUpdate. If 
avd_saImmOiRtObjectUpdate_sync() does not succeed, the update is currently 
queued up by calling avd_saImmOiRtObjectUpdate(), this has been done in #2188
- Expanding rules to detect the loss: As in this ticket, if any susi has HA as 
QUIESCED, at least one related entity must be LOCKED. This rule will be added 
to avd_susi_validate_headless_cached_rta()
- Once loss is detected, the loss needs to be isolated to avoid AMFD's crash, 
the idea is AMFD will remove all exisiting assignments of SUs belonging the SG 
that found the loss. This could be done by using SG::su_fault(), but it needs 
to be verified.
- document updates for behavior of loss detection.

Any ideas please advice.

Thanks,
Minh


---

** [tickets:#2210] AMFD: Loss of RT attribute update before headless**

**Status:** assigned
**Milestone:** 5.2.FC
**Created:** Mon Nov 28, 2016 10:18 PM UTC by Minh Hon Chau
**Last Updated:** Wed Dec 14, 2016 12:42 AM UTC
**Owner:** Minh Hon Chau


A loss of IMM RT saAmfSIAdminState update in AMFD has been seen just before 
cluster goes headless. It results in coredump after headless.

One scenario is:
- Issue amf-admin shutdown SI, delay csi quiescing callback
- Stop SCs, release csi quiescing callback
- Restart SCs
Observation: the saAmfSIAdminState is read as UNLOCKED while related SUSI was 
QUIESCED, and coredump as below

~~~
Thread 1 (Thread 0x7fec174a0780 (LWP 493)):
#0  0x00000000004fbfd5 in SG_2N::node_fail_si_oper (this=0x24109d0, 
su=0x2413440) at sg_2n_fsm.cc:3102
        s_susi = 0x8f50000000b
        susi_temp = 0x5fa169
        o_su = 0x2417f98
        __FUNCTION__ = "node_fail_si_oper"
        cb = 0x919240 <_control_block>
#1  0x00000000004fe69c in SG_2N::node_fail (this=0x24109d0, cb=0x919240 
<_control_block>, su=0x2413440) at sg_2n_fsm.cc:
3469
        a_susi = 0x1
        s_susi = 0x7fffedecd2d0
        o_su = 0x5a50bd <AVD_SU::any_susi_fsm_in(unsigned int)+497>
        flag = 2
        __FUNCTION__ = "node_fail"
        su_ha_state = 0
#2  0x0000000000513010 in AVD_SG::failover_absent_assignment (this=0x24109d0) 
at sg.cc:2273
        su = @0x2411330: 0x2413440
        __for_range = std::vector of length 2, capacity 2 = {0x2413440, 
0x24111e0}
        __for_begin = 
        __for_end = 
        __FUNCTION__ = "failover_absent_assignment"
#3  0x000000000043be65 in avd_cluster_tmr_init_evh (cb=0x919240 
<_control_block>, evt=0x7fec04000df0) at cluster.cc:103
        i_sg = 0x24109d0
        it = {first = "safSg=1,safApp=osaftest", second = }
        __FUNCTION__ = "avd_cluster_tmr_init_evh"
        su = 0x0
        node = 0x240f9b0
~~~



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to