OK I’ll try the patch (can it be applied to 4.3.0 or do I need to take a 
snapshot from the dev branch)?

Also found the following core’s from osafamfd last night.  Would these be 
related to this problem also or is this something else?

thanks
—
tony

— Assertion in osafamfd ---

#3  0x0000000000440fba in avd_mds_qsd_role_evh (cb=0x6c4d40 <_avd_cb>, 
evt=0x7f87240080a0) at avd_role.c:575
575 osafassert(0);
(gdb) p cb
$4 = (AVD_CL_CB *) 0x6c4d40 <_avd_cb>
(gdb) p *cb
$5 = {avd_mbx = 4291821569, avd_hb_mbx = 0, mds_handle = 0x0, init_state = 
AVD_APP_STATE, avd_fover_state = false, avail_state_avd = SA_AMF_HA_ACTIVE, 
vaddr_pwe_hdl = 65537, vaddr_hdl = 1, adest_hdl = 131071,
  vaddr = 1, other_avd_adest = 0, local_avnd_adest = 298033400332316, 
nd_msg_queue_list = {nd_msg_queue = 0x0, tail = 0x0}, evt_queue = 
{evt_msg_queue = 0x0, tail = 0x0}, mbcsv_hdl = 4293918753,
  ckpt_hdl = 4292870177, mbcsv_sel_obj = 13, stby_sync_state = 
AVD_STBY_IN_SYNC, synced_reo_type = 0, async_updt_cnt = {cb_updt = 61, 
node_updt = 8101, app_updt = 78, sg_updt = 3980, su_updt = 5980, si_updt = 2420,
    sg_su_oprlist_updt = 1210, sg_admin_si_updt = 0, siass_updt = 1437, 
comp_updt = 6488, csi_updt = 0, compcstype_updt = 0, si_trans_updt = 0}, 
sync_required = true, async_updt_msgs = {async_updt_queue = 0x0,
    tail = 0x0}, edu_hdl = {is_inited = true, tree = {root_node = {bit = -1, 
left = 0x8dc760, right = 0x6c4e08 <_avd_cb+200>, key_info = 0x8dc650 ""}, 
params = {key_size = 8, info_size = 0, actual_key_size = 0,
        node_size = 0}, n_nodes = 22}, to_version = 6}, mds_edu_hdl = 
{is_inited = true, tree = {root_node = {bit = -1, left = 0x8dbbe0, right = 
0x6c4e50 <_avd_cb+272>, key_info = 0x8ca840 ""}, params = {
        key_size = 8, info_size = 393, actual_key_size = 0, node_size = 0}, 
n_nodes = 19}, to_version = 4}, cluster_init_time = 0, node_id_avd = 69391, 
node_id_avd_other = 69647, node_avd_failed = 0, node_list = {
    root_node = {bit = -1, left = 0x6c4ea8 <_avd_cb+360>, right = 0x6c4ea8 
<_avd_cb+360>, key_info = 0x8ca820 ""}, params = {key_size = 4, info_size = 0, 
actual_key_size = 0, node_size = 0}, n_nodes = 0},
  amf_init_tmr = {tmr_id = 0x7f8724004020, type = AVD_TMR_CL_INIT, node_id = 0, 
spons_si_name = {length = 0, value = '\000' <repeats 255 times>}, dep_si_name = 
{length = 0, value = '\000' <repeats 255 times>},
    is_active = false}, heartbeat_tmr = {tmr_id = 0x7f8724006f70, type = 
AVD_TMR_SND_HB, node_id = 0, spons_si_name = {length = 0, value = '\000' 
<repeats 255 times>}, dep_si_name = {length = 0,
      value = '\000' <repeats 255 times>}, is_active = true}, 
heartbeat_tmr_period = 10000000000, nodes_exit_cnt = 15, ntfHandle = 
4279238657, ext_comp_info = {local_avnd_node = 0x0, ext_comp_hlt_check = 0x0},
  peer_msg_fmt_ver = 4, avd_peer_ver = 6, immOiHandle = 77309480719, 
immOmHandle = 81604448015, imm_sel_obj = 17, is_implementer = true, clmHandle = 
4285530113, clm_sel_obj = 15, swap_switch = SA_FALSE,
  active_services_exist = true}
(gdb) p evt
$6 = (AVD_EVT *) 0x7f87240080a0
(gdb) p *evt
$7 = {next = {next = 0x0}, rcv_evt = AVD_EVT_MDS_QSD_ACK, info = {avnd_msg = 
0x0, avd_msg = 0x0, node_id = 0, tmr = {tmr_id = 0x0, type = AVD_TMR_SND_HB, 
node_id = 0, spons_si_name = {length = 0,
        value = '\000' <repeats 255 times>}, dep_si_name = {length = 0, value = 
'\000' <repeats 255 times>}, is_active = false}}}


Core was generated by `/usr/lib64/opensaf/osafamfd osafamfd'.
Program terminated with signal 6, Aborted.
#0  0x0000003e42234bb5 in __GI_raise (sig=<optimized out>) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64   return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x0000003e42234bb5 in __GI_raise (sig=<optimized out>) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003e42237d13 in __GI_abort () at abort.c:91
#2  0x0000003e4361a602 in __osafassert_fail (__file=0x4ac056 "avd_role.c", 
__line=575, __func=0x4acca0 <__FUNCTION__.12339> "avd_mds_qsd_role_evh", 
__assertion=0x4ac5e0 "0") at sysf_def.c:301
#3  0x0000000000440fba in avd_mds_qsd_role_evh (cb=0x6c4d40 <_avd_cb>, 
evt=0x7f87240080a0) at avd_role.c:575
#4  0x000000000043fd56 in avd_process_event (cb_now=0x6c4d40 <_avd_cb>, 
evt=0x7f87240080a0) at avd_proc.c:591
#5  0x000000000043fab7 in avd_main_proc () at avd_proc.c:507
#6  0x0000000000409e79 in main (argc=2, argv=0x7fffec8e6648) at amfd_main.c:47



— Assertion in clm —

(gdb) p nodeAddress
$1 = (SaClmNodeAddressT *) 0x7f916c005b44
(gdb) p *nodeAddress
$2 = {family = (unknown: 0), length = 19365, value = 
"\000\000\000\000\000\000\017\001\001\000\001", '\000' <repeats 52 times>}


#0  0x0000003e42234bb5 in __GI_raise (sig=<optimized out>) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64   return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x0000003e42234bb5 in __GI_raise (sig=<optimized out>) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003e42237d13 in __GI_abort () at abort.c:91
#2  0x0000003e4361a602 in __osafassert_fail (__file=0x425d55 "clms_mds.c", 
__line=307, __func=0x4263a0 <__FUNCTION__.9929> "encodeNodeAddressT", 
__assertion=0x425e5a "0") at sysf_def.c:301
#3  0x000000000041dd5e in encodeNodeAddressT (uba=0x7f9173ffe6c8, 
nodeAddress=0x7f916c005b44) at clms_mds.c:307
#4  0x000000000041de72 in clms_enc_node_get_msg (uba=0x7f9173ffe6c8, 
msg=0x7f916c005b40) at clms_mds.c:332
#5  0x000000000041e23d in clms_enc_cluster_ntf_buf_msg (uba=0x7f9173ffe6c8, 
notify_info=0x7f9173ffeb88) at clms_mds.c:418
#6  0x000000000041e57b in clms_enc_track_cbk_msg (uba=0x7f9173ffe6c8, 
msg=0x7f9173ffeb70) at clms_mds.c:533
#7  0x000000000041ecf7 in clms_mds_enc (info=0x7f9173ffe700) at clms_mds.c:724
#8  0x000000000041f411 in clms_mds_enc_flat (info=0x7f9173ffe700) at 
clms_mds.c:908
#9  0x000000000041fb0f in clms_mds_callback (info=0x7f9173ffe700) at 
clms_mds.c:1184
#10 0x0000003e4364e6b7 in mcm_msg_encode_full_or_flat_and_send (to=2 '\002', 
to_msg=0x7f9173ffe8c0, to_svc_id=35, svc_cb=0x6307d0, adest=299135479218219, 
dest_vdest_id=65535, snd_type=0, xch_id=0,
    pri=MDS_SEND_PRIORITY_MEDIUM) at mds_c_sndrcv.c:1417
#11 0x0000003e4364d96f in mds_mcm_send_msg_enc (to=2 '\002', svc_cb=0x6307d0, 
to_msg=0x7f9173ffe8c0, to_svc_id=35, dest_vdest_id=65535, req=0x7f9173ffe980, 
xch_id=0, dest=299135479218219, pri=MDS_SEND_PRIORITY_MEDIUM)
    at mds_c_sndrcv.c:1084
#12 0x0000003e4364d6b3 in mcm_pvt_normal_snd_process_common (env_hdl=65552, 
fr_svc_id=34, to_msg=..., to_dest=299135479218219, to_svc_id=35, 
req=0x7f9173ffe980, pri=MDS_SEND_PRIORITY_MEDIUM, xch_id=0)
    at mds_c_sndrcv.c:1033
#13 0x0000003e4364d1f8 in mcm_pvt_normal_svc_snd (env_hdl=65552, fr_svc_id=34, 
msg=0x7f9173ffeb70, to_dest=299135479218219, to_svc_id=35, req=0x7f9173ffe980, 
pri=MDS_SEND_PRIORITY_MEDIUM) at mds_c_sndrcv.c:890
#14 0x0000003e4364cc8b in mds_mcm_send (info=0x7f9173ffeab0) at 
mds_c_sndrcv.c:675
#15 0x0000003e4364c2a6 in mds_send (info=0x7f9173ffeab0) at mds_c_sndrcv.c:384
#16 0x0000003e4364bf12 in ncsmds_api (svc_to_mds_info=0x7f9173ffeab0) at 
mds_papi.c:104
#17 0x00000000004201d7 in clms_mds_msg_send (cb=0x62ac40 <_clms_cb>, 
msg=0x7f9173ffeb70, dest=0x65a668, mds_ctxt=0x0, prio=MDS_SEND_PRIORITY_MEDIUM, 
svc_id=NCSMDS_SVC_ID_CLMA) at clms_mds.c:1453
#18 0x000000000040f499 in clms_prep_and_send_track (cb=0x62ac40 <_clms_cb>, 
node=0x654ae0, client=0x65a640, step=SA_CLM_CHANGE_COMPLETED, 
notify=0x7f916c0009a0) at clms_imm.c:1064
#19 0x000000000040e8db in clms_send_track (cb=0x62ac40 <_clms_cb>, 
node=0x654ae0, step=SA_CLM_CHANGE_COMPLETED) at clms_imm.c:835
#20 0x0000000000409430 in clms_track_send_node_down (node=0x654ae0) at 
clms_evt.c:428
#21 0x000000000040ca38 in imm_impl_set_node_down_proc (_cb=0x62ac40 <_clms_cb>) 
at clms_imm.c:93
#22 0x0000003e42a07e18 in start_thread (arg=0x7f9173fff700) at 
pthread_create.c:309
#23 0x0000003e422e88bd in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:115
On Feb 21, 2014, at 7:35 AM, Neelakanta Reddy 
<[email protected]<mailto:[email protected]>> wrote:

Hi,

Comments inline.

/Neel.
On Friday 21 February 2014 05:45 PM, Tony Hart wrote:
Hi Neel,
Thanks for the analysis.  It seems that multiple components in this case 
tripped on the race condition, I assume from your description that the fix was 
only applied to CLM?  Also in this case the node didn’t recover despite 
multiple restarts - does that fit with the scenario in ticket 528?
Apply the patch for CLM and test. If it is reproducible, please share
the sylogs of both the controllers.
Is this reproducible? - not sure yet, this is the first time I’ve seen this 
particular crash, but we recently started testing on bigger systems and that 
could be a factor.

We really need a fix for this - should I open a ticket?

thanks
—
tony

On Feb 21, 2014, at 5:36 AM, Neelakanta Reddy 
<[email protected]<mailto:[email protected]>> wrote:

Hi,

The same problem is observed in CLM, and is fixed in
sourceforge.net/p/opensaf/tickets/528<http://sourceforge.net/p/opensaf/tickets/528>
 .
It is fixed in 4622, changeset for opensaf-4.3.x .

For other services, the problem is not yet fixed.

Can you, please confirm it is re-producible always.

/Neel.


On Friday 21 February 2014 05:30 AM, Tony Hart wrote:
4.3.0

BTW is there a way to tell at runtime what version is installed?


On Feb 20, 2014, at 4:03 AM, Neelakanta Reddy 
<[email protected]<mailto:[email protected]>> wrote:

Hi,

which version of OpenSAF is used. It looks to be an older release.

/Neel.

On Wednesday 19 February 2014 08:42 PM, Tony Hart wrote:
Hi Neel,
Thanks for the reply, I’ve attached a fuller log (just the osaf message) from 
SCM2,  unfortunately the logs from SCM1 are not available.

—
tony



------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121054471&iu=/4140/ostg.clktrk
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to