This is happening because immnd has restarted and amfd got BAD Handle and amfd
is in process of reinitialized and cb->immOiHandle is zero. At this point of
time, Amfd gets role change event(because of si swap on 2N) from Std to Act and
it fails in doing it because cb->immOiHandle is zero and
immutil_saImmOiImplementerClear fails.
The best solution is to reject role change when cb->immOiHandle is zero as we
can't wait how much time immnd will take in intialiazing the handle. This will
again make Amfd to Act(from Quisced) and another Amfd as Standby.
Any thoughts?
---
** [tickets:#405] ImplementerClear returns BAD HANDLE for the Applier OI when
IMMND restarts during switchover**
**Status:** accepted
**Milestone:** 4.4.2
**Created:** Fri May 31, 2013 05:39 AM UTC by Nagendra Kumar
**Last Updated:** Wed Dec 31, 2014 05:15 AM UTC
**Owner:** Nagendra Kumar
Migrated from http://devel.opensaf.org/ticket/2656
When IMMND is killed when switchover in progress, the AMFD applier OI(i.e. the
STANDBY that is becoming ACTIVE) gets BAD HANDLE for saImmOiImplementerClear().
This results in the AMFD asserting and switchover failure.
Attached is the log and the assert information.
Snippet from /var/log/messages:
May 7 12:22:00 TMP-SLOT osafimmnd[7062]: SERVER STATE: IMM_SERVER_SYNC_CLIENT
—> IMM SERVER READY
May 7 12:22:00 TMP-SLOT osafamfd[5910]: Switching StandBy? —> Active State
May 7 12:22:00 TMP-SLOT osafamfd[5910]: Switch Standby —> Active FAILED,
ImplementerClear? failed 9
May 7 12:22:00 TMP-SLOT osafimmd[5852]: Received IMMD service event
May 7 12:22:00 TMP-SLOT osafimmd[5852]: Received IMMD service event
May 7 12:22:00 TMP-SLOT osafamfd[5910]: avd_msg_sanity_chk: invalid msg id 80,
from 20200 should be 85
May 7 12:22:00 TMP-SLOT osafimmnd[7062]: Implementer disconnected 22 <0, 20100>
(@safAmfService20100)
May 7 12:22:00 TMP-SLOT osafamfd[5910]: avd_msg_sanity_chk: invalid msg id 81,
from 20200 should be 85
May 7 12:22:00 TMP-SLOT osafamfd[5910]: avd_msg_sanity_chk: invalid msg id 82,
from 20200 should be 85
May 7 12:22:00 TMP-SLOT osafamfd[5910]: avd_msg_sanity_chk: invalid msg id 83,
from 20200 should be 85
May 7 12:22:00 TMP-SLOT osafamfd[5910]: avd_msg_sanity_chk: invalid msg id 84,
from 20200 should be 85
May 7 12:22:01 TMP-SLOT osafamfd[5910]: FAILOVER Active —> Quiesced FAILED,
ImplementerClear? failed 9
May 7 12:22:01 TMP-SLOT osafamfd[5910]: avd_role.c:585: avd_mds_qsd_role_evh:
Assertion '0' failed.
May 7 12:22:01 TMP-SLOT osafamfnd[5920]: AMF director unexpectedly crashed
May 7 12:22:01 TMP-SLOT osafamfnd[5920]: Rebooting OpenSAF NodeId? = 131584 EE
Name = , Reason: local AVD down(Adest) or both AVD down(Vdest) received
May 7 12:22:01 TMP-SLOT osafimmnd[7062]: Director Service in NOACTIVE state -
fevs replies pending:1 fevs highest processed:2407
Steps to reproduce
==================
1) Invoke switchover
2) kill IMMND on the controller that is becoming active
Backtrace of the amfd corefile:
#0 0x00007fe66dc14645 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x00007fe66dc14645 in raise () from /lib64/libc.so.6
#1 0x00007fe66dc15c33 in abort () from /lib64/libc.so.6
#2 0x00007fe66f225e15 in osafassert_fail (file=0x4a406a "avd_role.c", line=585,
func=0x4a4660 "avd_mds_qsd_role_evh",
assertion=0x4a46b8 "0") at sysf_def.c:399
#3 0x000000000043c1be in avd_mds_qsd_role_evh (cb=0x6bcb80, evt=0x7fe6680015b0)
at avd_role.c:585
#4 0x000000000043af86 in avd_process_event (cb_now=0x6bcb80,
evt=0x7fe6680015b0) at avd_proc.c:589
#5 0x000000000043ad0d in avd_main_proc () at avd_proc.c:505
#6 0x0000000000409210 in main (argc=1, argv=0x7fff776adcf8) at amfd_main.c:47
(gdb) fr 2
#2 0x00007fe66f225e15 in osafassert_fail (file=0x4a406a "avd_role.c", line=585,
func=0x4a4660 "avd_mds_qsd_role_evh",
assertion=0x4a46b8 "0") at sysf_def.c:399
399 sysf_def.c: No such file or directory.
in sysf_def.c
(gdb) fr 3
#3 0x000000000043c1be in avd_mds_qsd_role_evh (cb=0x6bcb80, evt=0x7fe6680015b0)
at avd_role.c:585
585 avd_role.c: No such file or directory.
in avd_role.c
(gdb) p *cb
$1 = {avd_mbx = 4291821569, avd_hb_mbx = 0, mds_handle = 0x0, init_state =
AVD_APP_STATE, avd_fover_state = false,
avail_state_avd = SA_AMF_HA_ACTIVE, vaddr_pwe_hdl = 65537, vaddr_hdl = 1,
adest_hdl = 131071, vaddr = 1,
other_avd_adest = 564051710926873, local_avnd_adest = 565151639322652,
nd_msg_queue_list = {nd_msg_queue = 0x0, tail = 0x0},
evt_queue = {evt_msg_queue = 0x0, tail = 0x0}, mbcsv_hdl = 4293918753, ckpt_hdl
= 4292870177 begin_of_the_skype_highlighting 4292870177
end_of_the_skype_highlighting, mbcsv_sel_obj = 13,
stby_sync_state = AVD_STBY_IN_SYNC, synced_reo_type = 13, async_updt_cnt =
{cb_updt = 13, node_updt = 699, app_updt = 12, sg_updt = 68,
su_updt = 174, si_updt = 416, sg_su_oprlist_updt = 43, sg_admin_si_updt = 0,
siass_updt = 80, comp_updt = 756, csi_updt = 0,
compcstype_updt = 0, si_trans_updt = 0}, sync_required = true, async_updt_msgs
= {async_updt_queue = 0x0, tail = 0x0}, edu_hdl = {
is_inited = true, tree = {root_node = {bit = -1, left = 0x6d3c00, right =
0x6bcc48, key_info = 0x6d3be0 ""}, params = {key_size = 8,
info_size = 16843009, actual_key_size = 1869266944
begin_of_the_skype_highlighting 1869266944
end_of_the_skype_highlighting, node_size = 32742}, n_nodes = 32}, to_version =
4}, cluster_init_time = 0,
node_id_avd = 131584, node_id_avd_other = 131328, node_avd_failed = 0,
node_list = {root_node = {bit = -1, left = 0x6bcca0,
right = 0x6bcca0, key_info = 0x6d3bc0 ""}, params = {key_size = 4, info_size =
0, actual_key_size = 0, node_size = 0}, n_nodes = 0},
amf_init_tmr = {tmr_id = 0x0, type = AVD_TMR_SND_HB, node_id = 0, spons_si_name
= {length = 0, value = '\0' <repeats 255 times>},
dep_si_name = {length = 0, value = '\0' <repeats 255 times>}, is_active =
false}, heartbeat_tmr = {tmr_id = 0x6f0fb0,
type = AVD_TMR_SND_HB, node_id = 0, spons_si_name = {length = 0, value = '\0'
<repeats 255 times>}, dep_si_name = {length = 0,
value = '\0' <repeats 255 times>}, is_active = true}, heartbeat_tmr_period =
10000000000, nodes_exit_cnt = 4,
ntfHandle = 4279238657 begin_of_the_skype_highlighting 4279238657
end_of_the_skype_highlighting, ext_comp_info = {local_avnd_node = 0x0,
ext_comp_hlt_check = 0x0}, peer_msg_fmt_ver = 4, avd_peer_ver = 4,
immOiHandle = 0, immOmHandle = 34359869952, imm_sel_obj = 17, is_implementer =
false, clmHandle = 4285530113 begin_of_the_skype_highlighting
4285530113 end_of_the_skype_highlighting, clm_sel_obj = 15,
swap_switch = SA_FALSE}
(gdb) p *evt
$2 = {next = {next = 0x0}, rcv_evt = AVD_EVT_MDS_QSD_ACK, info = {avnd_msg =
0x0, avd_msg = 0x0, node_id = 0, tmr = {tmr_id = 0x0,
type = AVD_TMR_SND_HB, node_id = 0, spons_si_name = {length = 0, value = '\0'
<repeats 255 times>}, dep_si_name = {length = 0,
value = '\0' <repeats 255 times>}, is_active = false}}}
Changed 13 months ago by neelakanta ¶
Standby AMFD registers as an applier. If the IMMND restarts the applier OI
gets exposed and a BAD_HANDLE is returned to AMFD.
Presently amfd tries to reinitialize with IMMND in a separate thread when
dispatch returns BAD_HANDLE. At the same time, in the main thread
ImmOiplementerClear?() is attempted as a part of switchover processing.
One possible solution is:
If the ImmOihandle? is zero and the implementerclear return BAD_HANDLE, AMFD
can ignore this. As per the current flow of AMFD, subsequently when it attempts
the ImplementerSet?, if the handle is still zero and implementerSEt() returns
BAD_HANDLE, AMFD can do a try again from the main thread, until the reinit_bg
thread completes.
Changed 13 months ago by anders ¶
See also:
http://devel.opensaf.org/ticket/1933
Any solution to this failure case (coping with local immnd crash during
switchover),
should be just a special case of coping with local immnd crash during normal
processing,
plus I assume immnd crash during other use cases.
Changed 13 months ago by nagendra ¶
Can you please upload more syslog from both the controllers, so that we want
to see when the controller rebooted and rejoined, when TIPC link was down, etc.
It will also help us in 2657.
Changed 13 months ago by praveenmalviya ¶
■owner changed from ravisekhar to praveenmalviya
■status changed from new to accepted
Changed 13 months ago by praveenmalviya ¶
■patch_waiting changed from no to yes
Changed 13 months ago by hafe ¶
■patch_waiting changed from yes to no
http://list.opensaf.org/pipermail/devel/2012-May/026219.html
---
Sent from sourceforge.net because [email protected] is
subscribed to https://sourceforge.net/p/opensaf/tickets/
To unsubscribe from further messages, a project admin can change settings at
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a
mailing list, you can unsubscribe from the mailing list.------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets