(Top-posting for brain dump form.)

I have been there, albeit with half the number of running mcollective daemons 
and 2 leaf nodes plus a hub. (You can also read my past list posts here and on 
activemq-user for a flavour of how I ended up here. I have to add a disclaimer 
that my skill is at a certain level and I was never able to surmount these 
difficulties myself. If something appears less optimal to you it probably is.)

Gist of our final activemq template, likely revelatory concerning my level of 
activemq expertise (or lack thereof): 
https://gist.github.com/christopherwood/90752942b3ce3fe9010a7fb1e7d078ba

Advice bits follow:

1) Reduce the number of subcollectives as far as you can, each additional 
subcollective is another set of queues/topics and the activemq capacity 
required to track all those subscribers. It seemed to me that ActiveMQ choked 
more on more things, rather than fewer, larger things.

2) Make sure you're using nio in your transportConnector stanzas. 
http://activemq.apache.org/nio-transport-reference.html

3) Tune your networkConnector stanzas.

4) I found turning networkConnector uri prefetch off entirely improved matters 
but I recall it broke some specific thing in mcollective (that I never really 
used so didn't miss) but I can't recall just what right now.

5) I (accidentally) found out that with a clustered ActiveMQ I could restart 
just the hub node and clustering would recover in 90% of cases. For the rest I 
would have to start 1) hub 2) each node separately with some delay between 
starting the daemons. At least only having to restart the hub shortened the 
outages.




For other people considering using nats:

As for upgrading to nats.io, you can trivially package the nats daemon and gem 
using fpm (https://github.com/jordansissel/fpm) which takes a lot of the 
difficulty away. I packaged the nats-pure gem to live in /opt/puppetlabs so I 
only needed one ruby per server, ymmv.

a) put gnatsd binary in /usr/sbin
b) fpm -s dir -t rpm -n "gnatsd" -v 0.9.4 --iteration 2 /usr/sbin/gnatsd
c) install fpm gem under /opt/puppetlabs
d) PATH=/opt/puppetlabs/puppet/bin:$PATH /opt/puppetlabs/puppet/bin/fpm -s gem 
--verbose -t rpm nats-pure
e) retrieve the rpms and rebuild this scratch server now you've done manual 
things to it


On Wed, May 17, 2017 at 02:03:17AM -0700, Steven Meunier wrote:
>    Hello all
> 
>    We're using MCollective with ActiveMQ but are running into problems with
>    stability of the ActiveMQ cluster. The basic problem is that we encounter
>    daily an issue where the mcollective client is unable to communicate with
>    or find any of the servers. The problem seems to be most clearly seen when
>    either ActiveMQ is restarted or the MCollective servers are restarted
>    en-masse like with the mcollectived restart that is part of the logrotate
>    config. Performing "mco inventory --list-collectives" will result in 0
>    hosts found. The only way to fix it seems to be to stop all ActiveMQ
>    instances and then start them one by one.
> 
>    We are busy preparing to migrate to Choria and Nats but if you have any
>    advice that would allow us to stabilise the cluster enough to buy us the
>    time we need to migrate, that would be greatly appreciated.
> 
>    Our setup is as follows: we have a central hub server with 4 leaf nodes.
>    The configuration for ActiveMQ has been taken from
>    
> https://github.com/puppetlabs/marionette-collective/tree/master/ext/activemq/examples/multi-broker.
>    The only changes we've made to that config is for the authorization
>    entries for our users and subcollectives. We've also changed the
>    systemUsage settings increasing memoryUsage to 512mb, storeUsage to 4gb
>    and tempUsage to 512mb.
> 
>    We upgraded ActiveMQ from 5.8 to 5.14.5 last week in the hopes of
>    increasing the stability but this has not helped. We are running puppet 4
>    with the puppet-agent 1.10.0 package.
> 
>    We have the following sysctl settings, which were adapted from the
>    "Learning MCollective" book (if my memory serves me correctly but they
>    were added a long time ago):
>    net.ipv4.tcp_fin_timeout: 15
>    net.ipv4.tcp_tw_reuse: 1
>    net.ipv4.tcp_keepalive_time: 300
>    net.ipv4.tcp_keepalive_intvl: 30
>    net.ipv4.tcp_keepalive_probes: 5
>    net.core.somaxconn: 2000
>    net.core.rmem_default: 256960
>    net.core.wmem_default: 256960
>    net.ipv4.tcp_timestamps: 0
> 
>    We have about 3100 MCollective servers in total. 700 or so connect to each
>    of the leaf nodes. We also have 14 subcollectives in total matching the
>    environments and datacenters that we have.
> 
>    When we have this issue the MCollective logs indicate that they are
>    connected to ActiveMQ and the ActiveMQ leaf nodes are connected to the
>    central hub.
> 
>    Do you have any advice that might help?
> 
>    --
> 
>    ---
>    You received this message because you are subscribed to the Google Groups
>    "mcollective-users" group.
>    To unsubscribe from this group and stop receiving emails from it, send an
>    email to [1]mcollective-users+unsubscr...@googlegroups.com.
>    For more options, visit [2]https://groups.google.com/d/optout.
> 
> References
> 
>    Visible links
>    1. mailto:mcollective-users+unsubscr...@googlegroups.com
>    2. https://groups.google.com/d/optout

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"mcollective-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mcollective-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to