[tickets] [opensaf:tickets] #1072 Sync stop after few payload nodes joining the cluster (TCP)

Adrian Szwej Mon, 15 Sep 2014 15:47:40 -0700

I have also tried following flavours:
**Larger MDS buffers**

    export MDS_SOCK_SND_RCV_BUF_SIZE=126976 
    DTM_SOCK_SND_RCV_BUF_SIZE=126976


**Longer keep alive settings**

**OpenSAF build 4.5**

**MTU 9000**
    veth4e51  Link encap:Ethernet  HWaddr aa:a6:f0:5f:0f:82  
              UP BROADCAST RUNNING  MTU:9000  Metric:1
    --
    veth76a4  Link encap:Ethernet  HWaddr 9a:ea:07:f4:be:55  
              UP BROADCAST RUNNING  MTU:9000  Metric:1
    --
    vethb5f5  Link encap:Ethernet  HWaddr 22:98:e3:39:32:34  
              UP BROADCAST RUNNING  MTU:9000  Metric:1
    --
    vethb9e3  Link encap:Ethernet  HWaddr d2:ec:18:c4:f9:2d  
              UP BROADCAST RUNNING  MTU:9000  Metric:1
    --
    vethd703  Link encap:Ethernet  HWaddr 3e:a0:49:c0:f0:73  
              UP BROADCAST RUNNING  MTU:9000  Metric:1
    --
    vethf736  Link encap:Ethernet  HWaddr 4e:c4:6e:74:fc:03  
              UP BROADCAST RUNNING  MTU:9000  Metric:1

Ping during sync between containers show latency of 0.250-0.500 ms.

The result is the same.
I can provoke the problem by cycling start/stop of 6th opensaf instance in 
linux container.

    while ( true ); do /etc/init.d/opensafd stop && /etc/init.d/opensafd start; 
done


---

** [tickets:#1072] Sync stop after few payload nodes joining the cluster (TCP)**

**Status:** invalid
**Milestone:** 4.3.3
**Created:** Fri Sep 12, 2014 09:20 PM UTC by Adrian Szwej
**Last Updated:** Mon Sep 15, 2014 09:48 PM UTC
**Owner:** Anders Bjornerstedt

Communication is MDS over TCP. Cluster 2+3; where scenario is 
Start SCs; start 1 payload; wait for sync; start second payload; wait for sync; 
start 3rd payload. Third one fails; or sometimes it might be forth.

There is problem of getting more than 2/3 payloads synchronized due to a 
consistent way of triggering a bug.

The following is triggered in the loading immnd causing the joined node to 
timeout/fail to start up.

Sep  6  6:58:02.096550 osafimmnd [502:immsv_evt.c:5382] T8 Received: 
IMMND_EVT_A2ND_SEARCHNEXT (17) from 2020f
Sep  6  6:58:02.096575 osafimmnd [502:immnd_evt.c:1443] >> 
immnd_evt_proc_search_next
Sep  6  6:58:02.096613 osafimmnd [502:immnd_evt.c:1454] T2 SEARCH NEXT, Look 
for id:1664
Sep  6  6:58:02.096641 osafimmnd [502:ImmModel.cc:1366] T2 ERR_TRY_AGAIN: Too 
many pending incoming fevs messages (> 16) rejecting sync iteration next request
Sep  6  6:58:02.096725 osafimmnd [502:immnd_evt.c:1676] << 
immnd_evt_proc_search_next
Sep  6  6:58:03.133230 osafimmnd [502:immnd_proc.c:1980] IN Sync Phase-3: 
step:540

I have managed to overcome this bug temporary by making following patch:

    +++ b/osaf/libs/common/immsv/include/immsv_api.h        Sat Sep 06 08:38:16 
2014 +0000
    @@ -70,7 +70,7 @@

     /*Max # of outstanding fevs messages towards director.*/
     /*Note max-max is 255. cb->fevs_replies_pending is an uint8_t*/
    -#define IMMSV_DEFAULT_FEVS_MAX_PENDING 16
    +#define IMMSV_DEFAULT_FEVS_MAX_PENDING 255

     #define IMMSV_MAX_OBJECTS 10000
     #define IMMSV_MAX_ATTRIBUTES 128



---

Sent from sourceforge.net because [email protected] is 
subscribed to https://sourceforge.net/p/opensaf/tickets/

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options.  Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

[tickets] [opensaf:tickets] #1072 Sync stop after few payload nodes joining the cluster (TCP)

Reply via email to