Hi Minh,
What are the steps to reproduce after applying the patch 2477_rep.diff?
Thanks,
Praveen
---
** [tickets:#2477] amfd: Cyclic reboot after SC absence period (in large
cluster)**
**Status:** review
**Milestone:** 5.17.06
**Labels:** assignment failover during stop of both SC 2416
**Created:** Fri Jun 02, 2017 06:17 AM UTC by Minh Hon Chau
**Last Updated:** Fri Jun 02, 2017 09:25 AM UTC
**Owner:** Minh Hon Chau
The scenario of the problem in this ticket happens in the same scenario
reported in #2416
After SC absence period, amfd gets into osafassert(), causes coredump, and the
problem repeatedly happens
One of patches of #2416 had tried to call IMM sync as soon as possible, and it
works fine with a small cluster (5 nodes). But a large cluster consists of
about 75 nodes, the change of IMM sync calls takes mostly no effect.
In #2416, a problem had been seen with an assumption of unreliable IMM sync
calls in which after SC absence period, amfd had 3 assignments for a 2N SG, 2
STANDBY SUSIs , and 1 ACTIVE SUSI. It was fixed by commit :"amfd: Add iteration
to failover all absent assignments [#2416]" (refer to:
https://sourceforge.net/p/opensaf/tickets/2416/#f83b)
Another variant problem of unreliable IMM calls before both SC go down, is that
amfd can have both SUs with ACTIVE assignments, that leads to assert. This
problem can only be seen in large cluster so far
Details of coredump:
~~~
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/lib64/opensaf/osafamfd'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f784279b0c7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: zypper install
opensaf-amf-director-debuginfo-5.2.0-469.0.6128a2d.sle12.x86_64
(gdb) bt full
#0 0x00007f784279b0c7 in raise () from /lib64/libc.so.6
No symbol table info available.
#1 0x00007f784279c478 in abort () from /lib64/libc.so.6
No symbol table info available.
#2 0x00007f78435fdf4e in __osafassert_fail (__file=<optimized out>,
__line=<optimized out>, __func=<optimized out>,
__assertion=<optimized out>) at ../../opensaf/src/base/sysf_def.c:286
No locals.
#3 0x00007f78445671e8 in avd_sg_2n_act_susi (sg=<optimized out>,
stby_susi=stby_susi@entry=0x7ffeef034998, cb=0x7f78447f2e80 <_control_block>)
at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:596
susi = <optimized out>
a_susi_2 = 0x7f7845e0d0c0
s_susi_1 = 0x7f7845e0d0c0
su_2 = <optimized out>
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
s_susi_2 = 0x7f7845e2a030
a_susi = 0x0
a_susi_1 = 0x7f7845e2a030
s_susi = 0x0
su_1 = 0x7f7845d69e60
#4 0x00007f784456d5d6 in SG_2N::node_fail (this=0x7f7845d5f4f0,
cb=0x7f78447f2e80 <_control_block>, su=0x7f7845d69e60)
at ../../opensaf/src/amf/amfd/sg_2n_fsm.cc:3402
a_susi = <optimized out>
s_susi = 0x7f7845d69a68
o_su = <optimized out>
flag = <optimized out>
__FUNCTION__ = "node_fail"
su_ha_state = <optimized out>
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
#5 0x00007f784455de1a in AVD_SG::failover_absent_assignment
(this=0x7f7845d5f4f0) at ../../opensaf/src/amf/amfd/sg.cc:2307
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = "failover_absent_assignment"
failed_su = 0x7f7845d69e60
#6 0x00007f7844514125 in avd_cluster_tmr_init_evh (cb=0x7f78447f2e80
<_control_block>, evt=<optimized out>)
at ../../opensaf/src/amf/amfd/cluster.cc:103
i_sg = 0x7f7845d5f4f0
__for_range = @0x7f7845ca2a90: {db = {_M_t = {
_M_impl =
{<std::allocator<std::_Rb_tree_node<std::pair<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> =
{<__gnu_cxx::new_allocator<std::_Rb_tree_node<std::pair<std::basic_string<char,
std::char_traits<char>, std::allocator<char> > const, AVD_SG*> > >> = {<No data
fields>}, <No data fields>},
_M_key_compare = {<std::binary_function<std::basic_string<char,
std::char_traits<char>, std::allocator<char> >, std::basic_string<char,
std::char_traits<char>, std::allocator<char> >, bool>> = {<No data fields>},
<No data fields>}, _M_header = {_M_color = std::_S_red,
_M_parent = 0x7f7845d515e0, _M_left = 0x7f7845d03ed0,
_M_right = 0x7f7845d81580}, _M_node_count = 28}}}}
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = "avd_cluster_tmr_init_evh"
su = 0x0
node = <optimized out>
#7 0x00007f784453ca2c in process_event (cb_now=0x7f78447f2e80
<_control_block>, evt=0x7f78340013d0) at ../../opensaf/src/amf/amfd/main.cc:775
t_ = {trace_leave_called = false, file_ = 0x0, function_ = 0x0}
__FUNCTION__ = "process_event"
#8 0x00007f78444f6abe in main_loop () at ../../opensaf/src/amf/amfd/main.cc:691
pollretval = <optimized out>
evt = 0x7f78340013d0
polltmo = 0
term_fd = 24
cb = 0x7f78447f2e80 <_control_block>
error = <optimized out>
old_sync_state = AVD_STBY_OUT_OF_SYNC
#9 main (argc=<optimized out>, argv=<optimized out>) at
../../opensaf/src/amf/amfd/main.cc:848
No locals.
~~~
---
Sent from sourceforge.net because opensaf-tickets@lists.sourceforge.net is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Opensaf-tickets mailing list
Opensaf-tickets@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets