Re: [mcollective-users] nats-pure and kernel tuning for gnatsd

Christopher Wood Tue, 04 Apr 2017 11:07:43 -0700

On Tue, Apr 04, 2017 at 07:10:56PM +0200, R.I.Pienaar wrote:
> 
> 
> On Tue, Apr 4, 2017, at 18:41, Christopher Wood wrote:
> > (Writing this out here for posterity and people seeing similar items.)
> > 
> > A little while ago I erroneously thought that gnatsd might use openssl
> > and thus had gnatsd tagged to restart on openssl package update via
> > puppet. (Found https://golang.org/pkg/crypto/ssl/, untagged the gnatsd
> > service.)
> > 
> > While gnatsd itself was fine after the restart, the server was not happy
> > with ~1.9k mcollectived reconnecting at once.
> > 
> > Mar 21 12:24:48 mcomq2 kernel: possible SYN flooding on port 4242.
> > Sending cookies.
> > 
> > The affected mcollectived were logging this and not retrying:
> > 
> > W, [2017-03-21T10:55:43.213823 #9006]  WARN -- : natswrapper.rb:117:in
> > `block (3 levels) in start' Disconnected from NATS: Client disconnected
> > from server on nats://mcomq2.me.com:4242
> > 
> > The solution was two-part:
> > 
> > 1) Upgrade choria to be able to update from eventmachine+nats gems to
> > nats-pure 0.2.2.
> > 
> > https://github.com/nats-io/pure-ruby-nats
> > https://github.com/choria-io/mcollective-choria
> > 
> > 2) Add some sysctls on the mcomq host to accomodate the initial rush of
> > connections.
> > 
> > sysctl { 'net.core.somaxconn': value => '4092' }
> > sysctl { 'net.ipv4.tcp_max_syn_backlog': value => '8192' }
> > 
> > https://forge.puppet.com/thias/sysctl
> > 
> > After that it has been back to smooth sailing.
> > 
> 
> Nice!, I'll add a note to the Choria docs to this effect.
> 
> I did also consider making the :reconnect_time_wait option be some
> random between 0 and 5 to spread the reconnects, right now its set to 1.
> 
> Do you think that would that have been a good choice given your
> experience?


For my specific case at the current time, if I restarted gnatsd with a 0s-5s 
wait I would have an average of 400 hosts connecting each second. I'm not the 
expert on kernel networking but I don't think I'd avoid the issue with the 
default net.core.somaxconn. I could federate the broker but that would really 
just put off tuning the kernel until later, and load is still really low on the 
mcomq host anyway. I also want things to reconnect promptly.

net.core.somaxconn = 128
net.ipv4.tcp_max_syn_backlog = 2048

$ mco find --dt 30 | wc -l
2028

This is not the biggest collective but seems larger than my mental "ok to run 
without tuning" threshold.

0 to something configurable with a default of 5 sounds good for a common 
starter setup case, as long as mcollectived logs the obvious. 

I, [datething]  INFO -- : mcollectived:123:in '<thing>' waiting 7 seconds to 
connect as configured by 
plugin.choria.reconnect_time_random_range_max__needs_better_name

The catch would obviously be that the implied maximum of mcollectived's that 
can connect is defined at the gnatsd host's kernel level which implies the 
collective owner is going to need to do the scaling work at some point anyway. 
If the average collective size is below net.core.somaxconn per second the 
config default might work. (Again, not great at kernel networking so my 
assumptions may be quite off.)


> -- 
> 
> --- 
> You received this message because you are subscribed to the Google Groups 
> "mcollective-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"mcollective-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [mcollective-users] nats-pure and kernel tuning for gnatsd

Reply via email to