Re: [tickets] [opensaf:tickets] #1072 Sync stop after few payload nodes joining the cluster (TCP)

Anders Björnerstedt Mon, 15 Sep 2014 23:21:08 -0700

Instead of blindly changing other configuration parameters, please first try to 
find out what the PROBLEM is.
Go back to OpensAF defaults on all settings, except IMMSV_FEVS_MAX_PENDING 
which you had
increased to 255 (the maximum possible).


You said you had "managed to overcome the perormance issue temporarily" by this 
increase to 255.
What does that mean ?
Do you still get the problem after some time? or not? with only that change.

How much traffic are you generating ?
Not counting SYNC traffic here, I mean YOUR application traffic.
Do you have zero traffic ?
Obviously it is possible to generate too much traffic on ANY configuration and 
you will end up with
symptoms like the ones you see.

If the problem appears "fixed" by the 255 (maximum) setting, try *reducing* 
IMMSV_FEVS_MAX_PENDING
down again by 50% from 255  (current maximum possible) to 128.
Test this some time and see if you have a stable system.
If stable repeat, i.e. reduce again by 50%, test again etc, untill you get to a 
level where the problem re-appears.
Then double the value back up to the lowest level where it appeared to be 
stable.

This would solve the problem if the cause is that your setup has more VARIANCE 
in latency,
more "bursty" traffic, more chunky scheduling of execution for the 
containters/processors/processes/threads.
If that is the case then the problem is not traffic overload but that you 
indeed need some buffers to be larger
to avoid the extremes of the variance to cut you off.

/AndersBj


________________________________
From: Adrian Szwej [mailto:[email protected]]
Sent: den 16 september 2014 00:47
To: [email protected]
Subject: [tickets] [opensaf:tickets] #1072 Sync stop after few payload nodes 
joining the cluster (TCP)


I have also tried following flavours:
Larger MDS buffers

export MDS_SOCK_SND_RCV_BUF_SIZE=126976
DTM_SOCK_SND_RCV_BUF_SIZE=126976


Longer keep alive settings

OpenSAF build 4.5

MTU 9000
veth4e51 Link encap:Ethernet HWaddr aa:a6:f0:5f:0f:82
UP BROADCAST RUNNING MTU:9000 Metric:1
--
veth76a4 Link encap:Ethernet HWaddr 9a:ea:07:f4:be:55
UP BROADCAST RUNNING MTU:9000 Metric:1
--
vethb5f5 Link encap:Ethernet HWaddr 22:98:e3:39:32:34
UP BROADCAST RUNNING MTU:9000 Metric:1
--
vethb9e3 Link encap:Ethernet HWaddr d2:ec:18:c4:f9:2d
UP BROADCAST RUNNING MTU:9000 Metric:1
--
vethd703 Link encap:Ethernet HWaddr 3e:a0:49:c0:f0:73
UP BROADCAST RUNNING MTU:9000 Metric:1
--
vethf736 Link encap:Ethernet HWaddr 4e:c4:6e:74:fc:03
UP BROADCAST RUNNING MTU:9000 Metric:1

Ping during sync between containers show latency of 0.250-0.500 ms.

The result is the same.
I can provoke the problem by cycling start/stop of 6th opensaf instance in 
linux container.

while ( true ); do /etc/init.d/opensafd stop && /etc/init.d/opensafd start; done


________________________________

[tickets:#1072]<http://sourceforge.net/p/opensaf/tickets/1072> Sync stop after 
few payload nodes joining the cluster (TCP)

Status: invalid
Milestone: 4.3.3
Created: Fri Sep 12, 2014 09:20 PM UTC by Adrian Szwej
Last Updated: Mon Sep 15, 2014 09:48 PM UTC
Owner: Anders Bjornerstedt

Communication is MDS over TCP. Cluster 2+3; where scenario is
Start SCs; start 1 payload; wait for sync; start second payload; wait for sync; 
start 3rd payload. Third one fails; or sometimes it might be forth.

There is problem of getting more than 2/3 payloads synchronized due to a 
consistent way of triggering a bug.

The following is triggered in the loading immnd causing the joined node to 
timeout/fail to start up.

Sep 6 6:58:02.096550 osafimmnd [502:immsv_evt.c:5382] T8 Received: 
IMMND_EVT_A2ND_SEARCHNEXT (17) from 2020f
Sep 6 6:58:02.096575 osafimmnd [502:immnd_evt.c:1443] >> 
immnd_evt_proc_search_next
Sep 6 6:58:02.096613 osafimmnd [502:immnd_evt.c:1454] T2 SEARCH NEXT, Look for 
id:1664
Sep 6 6:58:02.096641 osafimmnd [502:ImmModel.cc:1366] T2 ERR_TRY_AGAIN: Too 
many pending incoming fevs messages (> 16) rejecting sync iteration next request
Sep 6 6:58:02.096725 osafimmnd [502:immnd_evt.c:1676] << 
immnd_evt_proc_search_next
Sep 6 6:58:03.133230 osafimmnd [502:immnd_proc.c:1980] IN Sync Phase-3: step:540

I have managed to overcome this bug temporary by making following patch:

+++ b/osaf/libs/common/immsv/include/immsv_api.h        Sat Sep 06 08:38:16 
2014 +0000
@@ -70,7 +70,7 @@

 /*Max # of outstanding fevs messages towards director.*/
 /*Note max-max is 255. cb->fevs_replies_pending is an uint8_t*/
-#define IMMSV_DEFAULT_FEVS_MAX_PENDING 16
+#define IMMSV_DEFAULT_FEVS_MAX_PENDING 255

 #define IMMSV_MAX_OBJECTS 10000
 #define IMMSV_MAX_ATTRIBUTES 128


________________________________

Sent from sourceforge.net because [email protected] is 
subscribed to 
https://sourceforge.net/p/opensaf/tickets/<https://sourceforge.net/p/opensaf/tickets>

To unsubscribe from further messages, a project admin can change settings at 
https://sourceforge.net/p/opensaf/admin/tickets/options. Or, if this is a 
mailing list, you can unsubscribe from the mailing list.

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Opensaf-tickets mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-tickets

Re: [tickets] [opensaf:tickets] #1072 Sync stop after few payload nodes joining the cluster (TCP)

Reply via email to