I have finally pushed the first version of the RabbitMQ fix to gerrit: https://review.openstack.org/77409
I tried to keep changes to a minimum and do no refactoring, but due to high amount of code duplication and inconsistencies in RabbitMQ configuration for different OpenStack components the fix turned out more intrusive than I expected. Please review and test with care. Please note that the current version of the fix doesn't even fully cover the scope of part (1) from my plan quoted below: 1a) It doesn't change Neutron configuration, I need Sergey's help with this. Sergey, you already have a TODO item in sanitize_neutron_config() that is supposed to do exactly what's needed here, put a list of controller IPs with port 5673 into neutron_config[amqp][hosts]. 1b) It doesn't change Murano configuration. Murano seems to be using its own implementation of RabbitMQ based RPC backend instead of an almost homogenous zoo of impl_kombu implementations used by the rest of OpenStack. I'm not even sure it has the same reconnect mechanism as impl_kombu, can anyone from Murano team comment? I've also made no progress on parts (2) and (3) today (flush_routes and read_timeout), if there are people willing to work on this on Sunday in EU timezones, your help would be most welcome. Thanks, -DmitryB On Sat, Mar 1, 2014 at 2:46 AM, Dmitry Borodaenko <[email protected]> wrote: > The solution we have consists of 3 parts: > > 1) Reconfigure OpenStack services to bypass HAProxy and connect to > RabbitMQ directly on the controllers. Our testing shows that this > actually resolves the RabbitMQ side of the problem. > > I'm working on a fuel-library patch that will do that, should be > mostly straight-forward except for working around all the code > duplication and hardcoded values in different puppet modules. An > action item for EU timezone that I think would be most helpful is to > test the proposed configuration (rabbit_hosts=<controller-1 mgmt > ip>:5672,<controller-2 mgmt ip>:5672, etc.) in as many different > failover and vip move scenarios as possible. > > One more thing I'm considering that would be worth testing is to see > if it would be even better to point the controller services to > rabbit_hosts=127.0.0.1:5672, and leave only compute and other > non-controller nodes with the enumeration of controller management IPs > in rabbit_hosts. > > 2) Enable flush_routes option for management and public VIPs in crm > configuration, and restart HAProxy via crm when vip moves (including > after failover). Our testing shows that these two actions reduce the > probability of services locking up waiting for a read syscall to time > out on a hung MySQL connection. > > 3) Upgrade python-mysqldb to version 1.2.5 as requested in OSCI-1105, > and modify mysql connection strings to include read_timeout=90 (I'm > open to suggestions for the timeout value, since it drives the > duration of possible service outage after failover it should > definitely be lower than the Linux kernel default of 10 minutes, but > long enough not to drop connections due to slow SQL queries). This is > something I can't do without help from OSCI team: we need deb and rpm > packages built and tested so that we know they're safe to include in > 4.1, and can test them in combination with the other fixes. > > Thanks, > -DmitryB > > > On Sat, Mar 1, 2014 at 12:47 AM, Mike Scherbakov > <[email protected]> wrote: >> Folks, >> what is the current status on this? I saw a few comments in bug, but >> wondering about action items European timezone can take on Monday to >> continue the path. >> >> Thanks, >> >> >> On Fri, Feb 28, 2014 at 9:58 PM, Dmitry Borodaenko >> <[email protected]> wrote: >>> >>> Dear all, >>> >>> Please make sure that all discussions that occur elsewhere (this ML >>> thread, chats, etc.) end up reflected in the LaunchPad bug (even if a >>> theory is discussed and then eliminated, it's useful to have it >>> mentioned in the bug so that other people don't repeat the same line >>> of investigation). I originally emailed fuel-dev@ to only attract >>> attention to the problem, I did not intend to split the discussion. >>> >>> Thanks, >>> >>> On Fri, Feb 28, 2014 at 8:35 AM, Matthew Mosesohn >>> <[email protected]> wrote: >>> > I started reaching out to our community folks, Dina and Dmitry. >>> > >>> > We tried a few variations, but the same result: nova and cinder >>> > dislike having the AMQP backend shifted from underneath it. >>> > >>> > If we remove haproxy and connect directly to RabbitMQ on a virtual IP, >>> > all nova and cinder services die when we shift the virtual IP to >>> > another node. Neutron somehow survives and reconnects in about 25 >>> > seconds and picks up where it left off. >>> > >>> > For the record, we're running on 2013.2.2 code. Dmitry Mescheryakov >>> > asked me to provide a diff of what the RPC code is between neutron and >>> > cinder to maybe determine why Neutron can resume connections, but >>> > Cinder surely doesn't. Here is this diff: >>> > http://paste.openstack.org/show/uXyeYUGxMiAhmcGlK8VZ/ >>> > >>> > For more info: >>> > Errors we see in Cinder logs: >>> > http://pastie.org/private/w8iigjzijfczvsw5ddelwq >>> > Errors we see in Neutron logs: >>> > http://pastie.org/private/uelxryhbr42jijip0loe2w >>> > >>> > In the bug, mentioned earlier in this thread, we have a diagnostic >>> > snapshot. >>> > >>> > We're still digging for leads to fix this HA failover issue. >>> > >>> > -Matthew >>> > >>> > On Fri, Feb 28, 2014 at 1:12 PM, Vladimir Kuklin <[email protected]> >>> > wrote: >>> >> It will not help if you shut down the controller. The problem is that >>> >> you >>> >> have hanged AMQP sessions which kombu driver does not look to handle >>> >> correctly. >>> >> >>> >> >>> >> On Fri, Feb 28, 2014 at 1:09 PM, Bogdan Dobrelya >>> >> <[email protected]> >>> >> wrote: >>> >>> >>> >>> On 02/28/2014 05:44 AM, Dmitry Borodaenko wrote: >>> >>> > Team, >>> >>> > >>> >>> > Me and Ryan have spent all day investigating >>> >>> > https://bugs.launchpad.net/fuel/+bug/1285449 >>> >>> > >>> >>> > What we have found so far confirms that this is a critical bug that >>> >>> > absolutely must be resolved before 4.1 is released. I have >>> >>> > documented >>> >>> > our findings in the bug comments, someone please take over the >>> >>> > investigation when you come to the office tomorrow morning MSK time. >>> >>> > >>> >>> > I have a feeling that once the root cause is found, the fix will be >>> >>> > low-impact and will involve either change in HAProxy configuration >>> >>> > for >>> >>> > RabbitMQ, a patch/upgrade of HAProxy or kombu, or something similar. >>> >>> > But first we need to understand what exactly breaks, and why this >>> >>> > only >>> >>> > affects some services and not all of them. >>> >>> > >>> >>> > Thanks, >>> >>> > >>> >>> >>> >>> Here is recent rabbitMQ discussion quote from the >>> >>> Fuel-conductors-support team skype chat (RU + translation): >>> >>> >>> >>> Wednesday, February 26, 2014 >>> >>> [4:00:10 PM] Maxim Yefimov: Коллеги, вопрос есть: >>> >>> (I have a question) >>> >>> >>> >>> listen rabbitmq-openstack >>> >>> bind 192.168.0.2:5672 >>> >>> balance roundrobin >>> >>> >>> >>> server node-1 192.168.0.3:5673 check inter 5000 rise 2 fall 3 >>> >>> server node-2 192.168.0.4:5673 check inter 5000 rise 2 fall 3 >>> >>> backup >>> >>> server node-3 192.168.0.5:5673 check inter 5000 rise 2 fall 3 >>> >>> backup >>> >>> >>> >>> [4:01:01 PM] Maxim Yefimov: Зачем одновременно roundrobin и >>> >>> active-passive? >>> >>> (Why do we use roundrobin and active-passive at once for RabbitMQ?) >>> >>> >>> >>> [4:01:39 PM] Miroslav Anashkin: Чтобы коннект не рвался >>> >>> (To make sure the connection wouldn't break) >>> >>> >>> >>> [4:02:01 PM] Miroslav Anashkin: У кролика кластер существует строго в >>> >>> виде мастер-слейв >>> >>> (RabbitMQ clustering is restricted to master-slave only) >>> >>> >>> >>> [4:02:23 PM] Miroslav Anashkin: Соответственно даже если какая-то нода >>> >>> с >>> >>> запросом к слейву придет - та его на мастер отправит >>> >>> (Hence, any node's query to the RabbitMQ slave would have been re-sent >>> >>> to the master) >>> >>> >>> >>> [4:02:52 PM] Miroslav Anashkin: Поэтому сделали так чтобы ХАПрокси >>> >>> всегда всех посылал на одну ноду >>> >>> (Thats why HAproxy always redirects all queries to the single RabbitMQ >>> >>> node) >>> >>> >>> >>> And I'm not clear with this explanation, honestly. Why couldn't we >>> >>> make >>> >>> OS establish direct connections to arbitrary (LB) chosen RabbitMQ >>> >>> nodes >>> >>> skipping HAproxy at all? (because of this: "any node's query to the >>> >>> RabbitMQ slave would have been re-sent to the master") >>> >>> >>> >>> Could that resolve the issue? I think I will investigate this option >>> >>> as >>> >>> well. >>> >>> >>> >>> >>> >>> -- >>> >>> Best regards, >>> >>> Bogdan Dobrelya, >>> >>> Skype #bogdando_at_yahoo.com >>> >>> Irc #bogdando >>> >>> >>> >>> -- >>> >>> Mailing list: https://launchpad.net/~fuel-dev >>> >>> Post to : [email protected] >>> >>> Unsubscribe : https://launchpad.net/~fuel-dev >>> >>> More help : https://help.launchpad.net/ListHelp >>> >> >>> >> >>> >> >>> >> >>> >> -- >>> >> Yours Faithfully, >>> >> Vladimir Kuklin, >>> >> Senior Deployment Engineer, >>> >> Mirantis, Inc. >>> >> +7 (495) 640-49-04 >>> >> +7 (926) 702-39-68 >>> >> Skype kuklinvv >>> >> 45bk3, Vorontsovskaya Str. >>> >> Moscow, Russia, >>> >> www.mirantis.com >>> >> www.mirantis.ru >>> >> [email protected] >>> >> >>> >> -- >>> >> Mailing list: https://launchpad.net/~fuel-dev >>> >> Post to : [email protected] >>> >> Unsubscribe : https://launchpad.net/~fuel-dev >>> >> More help : https://help.launchpad.net/ListHelp >>> >> >>> > >>> > -- >>> > Mailing list: https://launchpad.net/~fuel-dev >>> > Post to : [email protected] >>> > Unsubscribe : https://launchpad.net/~fuel-dev >>> > More help : https://help.launchpad.net/ListHelp >>> >>> >>> >>> -- >>> Dmitry Borodaenko >>> >>> -- >>> Mailing list: https://launchpad.net/~fuel-dev >>> Post to : [email protected] >>> Unsubscribe : https://launchpad.net/~fuel-dev >>> More help : https://help.launchpad.net/ListHelp >> >> >> >> >> -- >> Mike Scherbakov >> #mihgen > > > > -- > Dmitry Borodaenko -- Dmitry Borodaenko -- Mailing list: https://launchpad.net/~fuel-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~fuel-dev More help : https://help.launchpad.net/ListHelp

