[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

Adrian Szwej Wed, 17 Sep 2014 17:11:04 -0700

#0  0x00007fe7eba49bb9 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007fe7eba4cfc8 in __GI_abort () at abort.c:89
#2  0x00007fe7eba42a76 in __assert_fail_base (fmt=0x7fe7ebb94370 "%s%s%s:%u: 
%s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x7fe7ec463b27 "0", 
    file=file@entry=0x7fe7ec4691df "mds_dt_trans.c", line=line@entry=94, 
    function=function@entry=0x7fe7ec4692a0 <__PRETTY_FUNCTION__.10222> 
"mds_mdtm_queue_add_unsent_msg") at assert.c:92
#3  0x00007fe7eba42b22 in __GI___assert_fail 
(assertion=assertion@entry=0x7fe7ec463b27 "0", file=file@entry=0x7fe7ec4691df 
"mds_dt_trans.c", 
    line=line@entry=94, function=function@entry=0x7fe7ec4692a0 
<__PRETTY_FUNCTION__.10222> "mds_mdtm_queue_add_unsent_msg") at assert.c:101
#4  0x00007fe7ec449e3d in mds_mdtm_queue_add_unsent_msg 
(tcp_buffer=tcp_buffer@entry=0x7fff623f3df0 "", bufflen=bufflen@entry=108) at 
mds_dt_trans.c:94
#5  0x00007fe7ec44a5b8 in mds_mdtm_unsent_queue_add_send 
(tcp_buffer=tcp_buffer@entry=0x7fff623f3df0 "", bufflen=bufflen@entry=108) at 
mds_dt_trans.c:153
#6  0x00007fe7ec44b05f in mds_mdtm_send_tcp (req=0x7fff623f3fe0) at 
mds_dt_trans.c:593
#7  0x00007fe7ec4541e8 in mcm_msg_encode_full_or_flat_and_send (pri=<optimized 
out>, xch_id=<optimized out>, snd_type=<optimized out>, 
    dest_vdest_id=<optimized out>, adest=<optimized out>, svc_cb=<optimized 
out>, to_svc_id=<optimized out>, to_msg=<optimized out>, to=<optimized out>)
    at mds_c_sndrcv.c:1516
#8  mds_mcm_send_msg_enc (to=<optimized out>, svc_cb=svc_cb@entry=0x18806a0, 
to_msg=to_msg@entry=0x7fff623f41d0, to_svc_id=to_svc_id@entry=25, 
    dest_vdest_id=<optimized out>, req=req@entry=0x7fff623f4270, 
xch_id=xch_id@entry=0, dest=568511936069707, 
pri=pri@entry=MDS_SEND_PRIORITY_MEDIUM)
    at mds_c_sndrcv.c:1086
#9  0x00007fe7ec4576db in mcm_pvt_process_svc_bcast_common (env_hdl=<optimized 
out>, fr_svc_id=fr_svc_id@entry=24, to_msg=..., to_svc_id=to_svc_id@entry=25, 
    req=req@entry=0x7fff623f4270, scope=NCSMDS_SCOPE_NONE, 
pri=pri@entry=MDS_SEND_PRIORITY_MEDIUM, flag=flag@entry=0 '\000') at 
mds_c_sndrcv.c:3882
#10 0x00007fe7ec458195 in mcm_pvt_normal_svc_bcast 
(pri=MDS_SEND_PRIORITY_MEDIUM, scope=<optimized out>, req=0x7fff623f4270, 
to_svc_id=25, msg=<optimized out>, 
    fr_svc_id=24, env_hdl=<optimized out>) at mds_c_sndrcv.c:3734
#11 mds_mcm_send (info=0x7fff623f4320) at mds_c_sndrcv.c:790
#12 mds_send (info=info@entry=0x7fff623f4320) at mds_c_sndrcv.c:386
#13 0x00007fe7ec4521c8 in ncsmds_api 
(svc_to_mds_info=svc_to_mds_info@entry=0x7fff623f4320) at mds_papi.c:104
#14 0x000000000040d482 in immd_mds_bcast_send (cb=cb@entry=0x629360 <_immd_cb>, 
evt=evt@entry=0x7fff623f4420, to_svc=to_svc@entry=NCSMDS_SVC_ID_IMMND)
    at immd_mds.c:765
#15 0x00000000004054bd in immd_evt_proc_fevs_req (cb=cb@entry=0x629360 
<_immd_cb>, evt=evt@entry=0x7fff623f4630, sinfo=sinfo@entry=0x7fe7e4001ad0, 
    deallocate=deallocate@entry=false) at immd_evt.c:314
#16 0x0000000000406e56 in immd_evt_proc_sync_fevs_base (cb=cb@entry=0x629360 
<_immd_cb>, sinfo=sinfo@entry=0x7fe7e4001ad0, evt=0x7fe7e4001990, 
    evt=0x7fe7e4001990) at immd_evt.c:1930
#17 0x0000000000407f57 in immd_process_evt () at immd_evt.c:164
#18 0x0000000000402781 in main (argc=<optimized out>, argv=<optimized out>) at 
immd_main.c:291




---

** [tickets:#1072] Sync stop after few payload nodes joining the cluster (TCP)**

**Status:** invalid
**Milestone:** 4.3.3
**Created:** Fri Sep 12, 2014 09:20 PM UTC by Adrian Szwej
**Last Updated:** Mon Sep 15, 2014 10:46 PM UTC
**Owner:** Anders Bjornerstedt

Communication is MDS over TCP. Cluster 2+3; where scenario is 
Start SCs; start 1 payload; wait for sync; start second payload; wait for sync; 
start 3rd payload. Third one fails; or sometimes it might be forth.

There is problem of getting more than 2/3 payloads synchronized due to a 
consistent way of triggering a bug.

The following is triggered in the loading immnd causing the joined node to 
timeout/fail to start up.

Sep  6  6:58:02.096550 osafimmnd [502:immsv_evt.c:5382] T8 Received: 
IMMND_EVT_A2ND_SEARCHNEXT (17) from 2020f
Sep  6  6:58:02.096575 osafimmnd [502:immnd_evt.c:1443] >> 
immnd_evt_proc_search_next
Sep  6  6:58:02.096613 osafimmnd [502:immnd_evt.c:1454] T2 SEARCH NEXT, Look 
for id:1664
Sep  6  6:58:02.096641 osafimmnd [502:ImmModel.cc:1366] T2 ERR_TRY_AGAIN: Too 
many pending incoming fevs messages (> 16) rejecting sync iteration next request
Sep  6  6:58:02.096725 osafimmnd [502:immnd_evt.c:1676] << 
immnd_evt_proc_search_next
Sep  6  6:58:03.133230 osafimmnd [502:immnd_proc.c:1980] IN Sync Phase-3: 
step:540

I have managed to overcome this bug temporary by making following patch:

    +++ b/osaf/libs/common/immsv/include/immsv_api.h        Sat Sep 06 08:38:16 
2014 +0000
    @@ -70,7 +70,7 @@

     /*Max # of outstanding fevs messages towards director.*/
     /*Note max-max is 255. cb->fevs_replies_pending is an uint8_t*/
    -#define IMMSV_DEFAULT_FEVS_MAX_PENDING 16
    +#define IMMSV_DEFAULT_FEVS_MAX_PENDING 255

     #define IMMSV_MAX_OBJECTS 10000
     #define IMMSV_MAX_ATTRIBUTES 128



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] Re: #1072 Sync stop after few payload nodes joining the cluster (TCP)

Reply via email to