Hi Minh,

What are the steps to reproduce after applying the patch 2477_rep.diff?


Thanks,
Praveen


---

** [tickets:#2477] amfd: Cyclic reboot after SC absence period (in large 
cluster)**

**Status:** review
**Milestone:** 5.17.06
**Labels:** assignment failover during stop of both SC 2416 
**Created:** Fri Jun 02, 2017 06:17 AM UTC by Minh Hon Chau
**Last Updated:** Fri Jun 02, 2017 09:25 AM UTC
**Owner:** Minh Hon Chau


The scenario of the problem in this ticket happens in the same scenario 
reported in #2416

After SC absence period, amfd gets into osafassert(), causes coredump, and the 
problem repeatedly happens 

One of patches of #2416 had tried to call IMM sync as soon as possible, and it 
works fine with a small cluster (5 nodes). But a large cluster consists of 
about 75 nodes, the change of IMM sync calls takes mostly no effect. 

In #2416, a problem had been seen with an assumption of unreliable IMM sync 
calls in which after SC absence period, amfd had 3 assignments for a 2N SG, 2 
STANDBY SUSIs , and 1 ACTIVE SUSI. It was fixed by commit :"amfd: Add iteration 
to failover all absent assignments [#2416]" (refer to: 
https://sourceforge.net/p/opensaf/tickets/2416/#f83b)

Another variant problem of unreliable IMM calls before both SC go down, is that 
amfd can have both SUs with ACTIVE assignments, that leads to assert. This 
problem can only be seen in large cluster so far


Details of coredump:
 
~~~
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/opensaf/osafamfd'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f784279b0c7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: zypper install 
opensaf-amf-director-debuginfo-5.2.0-469.0.6128a2d.sle12.x86_64
(gdb) bt full
#0  0x00007f784279b0c7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x00007f784279c478 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x00007f78435fdf4e in __osafassert_fail (__file=<optimized out>, 
__line=<optimized out>, __func=<optimized out>, 
    __assertion=<optimized out>) at ../../opensaf/src/base/sysf_def.c:286
No locals.
#3  0x00007f78445671e8 in avd_sg_2n_act_susi (sg=<optimized out>, 
stby_susi=stby_susi@entry=0x7ffeef034998, cb=0x7f78447f2e80 <_control_block>)
    at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:596
        susi = <optimized out>
        a_susi_2 = 0x7f7845e0d0c0
        s_susi_1 = 0x7f7845e0d0c0
        su_2 = <optimized out>
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        s_susi_2 = 0x7f7845e2a030
        a_susi = 0x0
        a_susi_1 = 0x7f7845e2a030
        s_susi = 0x0
        su_1 = 0x7f7845d69e60
#4  0x00007f784456d5d6 in SG_2N::node_fail (this=0x7f7845d5f4f0, 
cb=0x7f78447f2e80 <_control_block>, su=0x7f7845d69e60)
    at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:3402
        a_susi = <optimized out>
        s_susi = 0x7f7845d69a68
        o_su = <optimized out>
        flag = <optimized out>
        __FUNCTION__ = "node_fail"
        su_ha_state = <optimized out>
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
#5  0x00007f784455de1a in AVD_SG::failover_absent_assignment 
(this=0x7f7845d5f4f0) at ../../opensaf/src/amf/amfd/sg.cc:2307
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "failover_absent_assignment"
        failed_su = 0x7f7845d69e60
#6  0x00007f7844514125 in avd_cluster_tmr_init_evh (cb=0x7f78447f2e80 
<_control_block>, evt=<optimized out>)
    at ../../opensaf/src/amf/amfd/cluster.cc:103
        i_sg = 0x7f7845d5f4f0
        __for_range = @0x7f7845ca2a90: {db = {_M_t = {
              _M_impl = 
{<std::allocator<std::_Rb_tree_node<std::pair<std::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = 
{<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::basic_string<char, 
std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = {<No data 
fields>}, <No data fields>}, 
                _M_key_compare = {<std::binary_function<std::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, std::basic_string<char, 
std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>}, 
<No data fields>}, _M_header = {_M_color = std::_S_red, 
                  _M_parent = 0x7f7845d515e0, _M_left = 0x7f7845d03ed0, 
_M_right = 0x7f7845d81580}, _M_node_count = 28}}}}
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "avd_cluster_tmr_init_evh"
        su = 0x0
        node = <optimized out>
#7  0x00007f784453ca2c in process_event (cb_now=0x7f78447f2e80 
<_control_block>, evt=0x7f78340013d0) at ../../opensaf/src/amf/amfd/main.cc:775
        t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
        __FUNCTION__ = "process_event"
#8  0x00007f78444f6abe in main_loop () at ../../opensaf/src/amf/amfd/main.cc:691
        pollretval = <optimized out>
        evt = 0x7f78340013d0
        polltmo = 0
        term_fd = 24
        cb = 0x7f78447f2e80 <_control_block>
        error = <optimized out>
        old_sync_state = AVD_STBY_OUT_OF_SYNC
#9  main (argc=<optimized out>, argv=<optimized out>) at 
../../opensaf/src/amf/amfd/main.cc:848
No locals.
~~~



---

Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Reply via email to