On 05/16/2014 10:57 AM, Bartosz Kupidura wrote: > Hello guys! > I would like to sugest a few changes to Fuel HA/scalability features. > > 1. [HA] Ensure public/management VIP is running on node where HAproxy is > working. > > Now if HAproxy dies, VIP is not moved to another node in a cluster. > Simple way to check this is (HAProxy can die after segfault, wrong config, > uninstalled package...): > # echo deadbeef >> /etc/haproxy/haproxy.cfg > # /etc/init.d/haproxy stop > > What happens: > - Corosync can not start HAproxy > - Corosync will NOT move VIP to another node > - ALL connections to VIPs got 'connection refused' > > What should happen: > - Corosync can not start HAproxy > - Corosync will move VIP to another node > > Gerrit change: http://gerrit.vm.mirantis.net:8080/#/c/15617/
Hello. Thank you for such a great feedback. It would be nice to provide an LP bugs for this patch as well as for all patches below and submit them as a public openstack gerrit ones. Please don't hesitate to submit, I could also help you to address it. As far as we are discussing here the issues related not only to the Fuel HA but to the Oslo.messaging, RabbitMQ and Nova configuration, I've added an openstack-dev tag as well. Perhaps, some of these changes could be also contributed to the Nova and Oslo. > > Now ocf:mirantis:haproxy check only if haproxy is running, in future we can > implement more sophisticated health checks (backend timeouts, current > connections limit...) > > 2. [HA] Tune TCP keepalive sysctl. > > Now we use default ubuntu/centos value (7200+9*75). > This mean kernel will notice ‘silent’ (not RST, not FIN) connection failure > after >2h. Yes, the defaults are (always) poor :-) Here is a list for an existing issues (and patches, if any were submitted already) https://etherpad.openstack.org/p/fuel-ha-rabbitmq That document also is a kind of a brainstorm, feel free to participate. Personally I like the ideas to put rabbit cluster management onto the Pacemaker and consider two only rabbits cluster in order to address https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/22, if there are indeed such a strange things are happening with cluster size 3+ > > From my experience good value for HA systems is 180s: > net.ipv4.tcp_keepalive_time = 120 > net.ipv4.tcp_keepalive_probes = 3 > net.ipv4.tcp_keepalive_intvl = 20 > > Gerrit change: http://gerrit.vm.mirantis.net:8080/#/c/15618/ Looks like your choice is better. There is a patch https://review.openstack.org/#/c/93815/ related to the https://bugs.launchpad.net/oslo.messaging/+bug/856764/comments/19 as well. We could discuss which exactly TCPKA parameters are fitting better. And one more related patch for RabbitMQ cluster https://review.openstack.org/#/c/93411/ > > 3. [Scalability] shuffle amqp nodes in Openstack configs. > > Now each Openstack node (compute, cinder, ...) connect to #1 controller, > after failure it reconnects to #2, after that to #3 controller. > > In this case, ALL AMQP traffic is served by #1. > > We can shuffle 'rabbit_hosts' on each node. > > Gerrit change: http://gerrit.vm.mirantis.net:8080/#/c/15619/ That is a brilliant idea. I was investigating the related things recently and googled this http://rabbitmq.1065348.n5.nabble.com/Correct-way-of-determining-which-node-is-master-td91.html. According to this thread, there could be a good performance benefit in spreading the queue masters around. But actually, we already have it in the recent amqp-nodes patches accepted for Fuel 5.0. Let me elaborate. We configure rabbit hosts for all controllers as: rabbit_nodes = 127.0.0.1:5673, rabbit1:5673, ... rabbitX:5673. As you can see, the initial connection point for new queues will always be the node itself, hence all master queues would be automatically "shuffled" as well. Basically, I see this as a main reason of why we shouldn't use VIP for rabbit cluster ever. > > > Best Regards, > Bartosz Kupidura > -- Best regards, Bogdan Dobrelya, Skype #bogdando_at_yahoo.com Irc #bogdando -- Mailing list: https://launchpad.net/~fuel-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~fuel-dev More help : https://help.launchpad.net/ListHelp

