On Tue, Apr 04, 2017 at 07:10:56PM +0200, R.I.Pienaar wrote: > > > On Tue, Apr 4, 2017, at 18:41, Christopher Wood wrote: > > (Writing this out here for posterity and people seeing similar items.) > > > > A little while ago I erroneously thought that gnatsd might use openssl > > and thus had gnatsd tagged to restart on openssl package update via > > puppet. (Found https://golang.org/pkg/crypto/ssl/, untagged the gnatsd > > service.) > > > > While gnatsd itself was fine after the restart, the server was not happy > > with ~1.9k mcollectived reconnecting at once. > > > > Mar 21 12:24:48 mcomq2 kernel: possible SYN flooding on port 4242. > > Sending cookies. > > > > The affected mcollectived were logging this and not retrying: > > > > W, [2017-03-21T10:55:43.213823 #9006] WARN -- : natswrapper.rb:117:in > > `block (3 levels) in start' Disconnected from NATS: Client disconnected > > from server on nats://mcomq2.me.com:4242 > > > > The solution was two-part: > > > > 1) Upgrade choria to be able to update from eventmachine+nats gems to > > nats-pure 0.2.2. > > > > https://github.com/nats-io/pure-ruby-nats > > https://github.com/choria-io/mcollective-choria > > > > 2) Add some sysctls on the mcomq host to accomodate the initial rush of > > connections. > > > > sysctl { 'net.core.somaxconn': value => '4092' } > > sysctl { 'net.ipv4.tcp_max_syn_backlog': value => '8192' } > > > > https://forge.puppet.com/thias/sysctl > > > > After that it has been back to smooth sailing. > > > > Nice!, I'll add a note to the Choria docs to this effect. > > I did also consider making the :reconnect_time_wait option be some > random between 0 and 5 to spread the reconnects, right now its set to 1. > > Do you think that would that have been a good choice given your > experience?
For my specific case at the current time, if I restarted gnatsd with a 0s-5s wait I would have an average of 400 hosts connecting each second. I'm not the expert on kernel networking but I don't think I'd avoid the issue with the default net.core.somaxconn. I could federate the broker but that would really just put off tuning the kernel until later, and load is still really low on the mcomq host anyway. I also want things to reconnect promptly. net.core.somaxconn = 128 net.ipv4.tcp_max_syn_backlog = 2048 $ mco find --dt 30 | wc -l 2028 This is not the biggest collective but seems larger than my mental "ok to run without tuning" threshold. 0 to something configurable with a default of 5 sounds good for a common starter setup case, as long as mcollectived logs the obvious. I, [datething] INFO -- : mcollectived:123:in '<thing>' waiting 7 seconds to connect as configured by plugin.choria.reconnect_time_random_range_max__needs_better_name The catch would obviously be that the implied maximum of mcollectived's that can connect is defined at the gnatsd host's kernel level which implies the collective owner is going to need to do the scaling work at some point anyway. If the average collective size is below net.core.somaxconn per second the config default might work. (Again, not great at kernel networking so my assumptions may be quite off.) > -- > > --- > You received this message because you are subscribed to the Google Groups > "mcollective-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. -- --- You received this message because you are subscribed to the Google Groups "mcollective-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
