Mike & Team, It's quite a few changes, more than I would like to see at the end of 4.1. But on the other side, the 3 parts Dmitry mentioned have been tested in the lab and seem to solve most of the issues with HA/failover, so I'm all for including them into 4.1.
It would be good to know if there is any impact on the release date. We have moved it last week from Friday 2/28 to Tuesday 3/4. This means we have Mon MSK, Mon PT, and Tue MSK left to do the job. Since it's not a regression (it doesn't work in 4.0 either), another option would be not to include the fix and release as is. I don't think it's a good idea though. Thanks, Roman On Sat, Mar 1, 2014 at 2:46 AM, Dmitry Borodaenko <[email protected]>wrote: > The solution we have consists of 3 parts: > > 1) Reconfigure OpenStack services to bypass HAProxy and connect to > RabbitMQ directly on the controllers. Our testing shows that this > actually resolves the RabbitMQ side of the problem. > > I'm working on a fuel-library patch that will do that, should be > mostly straight-forward except for working around all the code > duplication and hardcoded values in different puppet modules. An > action item for EU timezone that I think would be most helpful is to > test the proposed configuration (rabbit_hosts=<controller-1 mgmt > ip>:5672,<controller-2 mgmt ip>:5672, etc.) in as many different > failover and vip move scenarios as possible. > > One more thing I'm considering that would be worth testing is to see > if it would be even better to point the controller services to > rabbit_hosts=127.0.0.1:5672, and leave only compute and other > non-controller nodes with the enumeration of controller management IPs > in rabbit_hosts. > > 2) Enable flush_routes option for management and public VIPs in crm > configuration, and restart HAProxy via crm when vip moves (including > after failover). Our testing shows that these two actions reduce the > probability of services locking up waiting for a read syscall to time > out on a hung MySQL connection. > > 3) Upgrade python-mysqldb to version 1.2.5 as requested in OSCI-1105, > and modify mysql connection strings to include read_timeout=90 (I'm > open to suggestions for the timeout value, since it drives the > duration of possible service outage after failover it should > definitely be lower than the Linux kernel default of 10 minutes, but > long enough not to drop connections due to slow SQL queries). This is > something I can't do without help from OSCI team: we need deb and rpm > packages built and tested so that we know they're safe to include in > 4.1, and can test them in combination with the other fixes. > > Thanks, > -DmitryB > > > On Sat, Mar 1, 2014 at 12:47 AM, Mike Scherbakov > <[email protected]> wrote: > > Folks, > > what is the current status on this? I saw a few comments in bug, but > > wondering about action items European timezone can take on Monday to > > continue the path. > > > > Thanks, > > > > > > On Fri, Feb 28, 2014 at 9:58 PM, Dmitry Borodaenko > > <[email protected]> wrote: > >> > >> Dear all, > >> > >> Please make sure that all discussions that occur elsewhere (this ML > >> thread, chats, etc.) end up reflected in the LaunchPad bug (even if a > >> theory is discussed and then eliminated, it's useful to have it > >> mentioned in the bug so that other people don't repeat the same line > >> of investigation). I originally emailed fuel-dev@ to only attract > >> attention to the problem, I did not intend to split the discussion. > >> > >> Thanks, > >> > >> On Fri, Feb 28, 2014 at 8:35 AM, Matthew Mosesohn > >> <[email protected]> wrote: > >> > I started reaching out to our community folks, Dina and Dmitry. > >> > > >> > We tried a few variations, but the same result: nova and cinder > >> > dislike having the AMQP backend shifted from underneath it. > >> > > >> > If we remove haproxy and connect directly to RabbitMQ on a virtual IP, > >> > all nova and cinder services die when we shift the virtual IP to > >> > another node. Neutron somehow survives and reconnects in about 25 > >> > seconds and picks up where it left off. > >> > > >> > For the record, we're running on 2013.2.2 code. Dmitry Mescheryakov > >> > asked me to provide a diff of what the RPC code is between neutron and > >> > cinder to maybe determine why Neutron can resume connections, but > >> > Cinder surely doesn't. Here is this diff: > >> > http://paste.openstack.org/show/uXyeYUGxMiAhmcGlK8VZ/ > >> > > >> > For more info: > >> > Errors we see in Cinder logs: > >> > http://pastie.org/private/w8iigjzijfczvsw5ddelwq > >> > Errors we see in Neutron logs: > >> > http://pastie.org/private/uelxryhbr42jijip0loe2w > >> > > >> > In the bug, mentioned earlier in this thread, we have a diagnostic > >> > snapshot. > >> > > >> > We're still digging for leads to fix this HA failover issue. > >> > > >> > -Matthew > >> > > >> > On Fri, Feb 28, 2014 at 1:12 PM, Vladimir Kuklin < > [email protected]> > >> > wrote: > >> >> It will not help if you shut down the controller. The problem is that > >> >> you > >> >> have hanged AMQP sessions which kombu driver does not look to handle > >> >> correctly. > >> >> > >> >> > >> >> On Fri, Feb 28, 2014 at 1:09 PM, Bogdan Dobrelya > >> >> <[email protected]> > >> >> wrote: > >> >>> > >> >>> On 02/28/2014 05:44 AM, Dmitry Borodaenko wrote: > >> >>> > Team, > >> >>> > > >> >>> > Me and Ryan have spent all day investigating > >> >>> > https://bugs.launchpad.net/fuel/+bug/1285449 > >> >>> > > >> >>> > What we have found so far confirms that this is a critical bug > that > >> >>> > absolutely must be resolved before 4.1 is released. I have > >> >>> > documented > >> >>> > our findings in the bug comments, someone please take over the > >> >>> > investigation when you come to the office tomorrow morning MSK > time. > >> >>> > > >> >>> > I have a feeling that once the root cause is found, the fix will > be > >> >>> > low-impact and will involve either change in HAProxy configuration > >> >>> > for > >> >>> > RabbitMQ, a patch/upgrade of HAProxy or kombu, or something > similar. > >> >>> > But first we need to understand what exactly breaks, and why this > >> >>> > only > >> >>> > affects some services and not all of them. > >> >>> > > >> >>> > Thanks, > >> >>> > > >> >>> > >> >>> Here is recent rabbitMQ discussion quote from the > >> >>> Fuel-conductors-support team skype chat (RU + translation): > >> >>> > >> >>> Wednesday, February 26, 2014 > >> >>> [4:00:10 PM] Maxim Yefimov: Коллеги, вопрос есть: > >> >>> (I have a question) > >> >>> > >> >>> listen rabbitmq-openstack > >> >>> bind 192.168.0.2:5672 > >> >>> balance roundrobin > >> >>> > >> >>> server node-1 192.168.0.3:5673 check inter 5000 rise 2 fall 3 > >> >>> server node-2 192.168.0.4:5673 check inter 5000 rise 2 fall 3 > >> >>> backup > >> >>> server node-3 192.168.0.5:5673 check inter 5000 rise 2 fall 3 > >> >>> backup > >> >>> > >> >>> [4:01:01 PM] Maxim Yefimov: Зачем одновременно roundrobin и > >> >>> active-passive? > >> >>> (Why do we use roundrobin and active-passive at once for RabbitMQ?) > >> >>> > >> >>> [4:01:39 PM] Miroslav Anashkin: Чтобы коннект не рвался > >> >>> (To make sure the connection wouldn't break) > >> >>> > >> >>> [4:02:01 PM] Miroslav Anashkin: У кролика кластер существует строго > в > >> >>> виде мастер-слейв > >> >>> (RabbitMQ clustering is restricted to master-slave only) > >> >>> > >> >>> [4:02:23 PM] Miroslav Anashkin: Соответственно даже если какая-то > нода > >> >>> с > >> >>> запросом к слейву придет - та его на мастер отправит > >> >>> (Hence, any node's query to the RabbitMQ slave would have been > re-sent > >> >>> to the master) > >> >>> > >> >>> [4:02:52 PM] Miroslav Anashkin: Поэтому сделали так чтобы ХАПрокси > >> >>> всегда всех посылал на одну ноду > >> >>> (Thats why HAproxy always redirects all queries to the single > RabbitMQ > >> >>> node) > >> >>> > >> >>> And I'm not clear with this explanation, honestly. Why couldn't we > >> >>> make > >> >>> OS establish direct connections to arbitrary (LB) chosen RabbitMQ > >> >>> nodes > >> >>> skipping HAproxy at all? (because of this: "any node's query to the > >> >>> RabbitMQ slave would have been re-sent to the master") > >> >>> > >> >>> Could that resolve the issue? I think I will investigate this option > >> >>> as > >> >>> well. > >> >>> > >> >>> > >> >>> -- > >> >>> Best regards, > >> >>> Bogdan Dobrelya, > >> >>> Skype #bogdando_at_yahoo.com > >> >>> Irc #bogdando > >> >>> > >> >>> -- > >> >>> Mailing list: https://launchpad.net/~fuel-dev > >> >>> Post to : [email protected] > >> >>> Unsubscribe : https://launchpad.net/~fuel-dev > >> >>> More help : https://help.launchpad.net/ListHelp > >> >> > >> >> > >> >> > >> >> > >> >> -- > >> >> Yours Faithfully, > >> >> Vladimir Kuklin, > >> >> Senior Deployment Engineer, > >> >> Mirantis, Inc. > >> >> +7 (495) 640-49-04 > >> >> +7 (926) 702-39-68 > >> >> Skype kuklinvv > >> >> 45bk3, Vorontsovskaya Str. > >> >> Moscow, Russia, > >> >> www.mirantis.com > >> >> www.mirantis.ru > >> >> [email protected] > >> >> > >> >> -- > >> >> Mailing list: https://launchpad.net/~fuel-dev > >> >> Post to : [email protected] > >> >> Unsubscribe : https://launchpad.net/~fuel-dev > >> >> More help : https://help.launchpad.net/ListHelp > >> >> > >> > > >> > -- > >> > Mailing list: https://launchpad.net/~fuel-dev > >> > Post to : [email protected] > >> > Unsubscribe : https://launchpad.net/~fuel-dev > >> > More help : https://help.launchpad.net/ListHelp > >> > >> > >> > >> -- > >> Dmitry Borodaenko > >> > >> -- > >> Mailing list: https://launchpad.net/~fuel-dev > >> Post to : [email protected] > >> Unsubscribe : https://launchpad.net/~fuel-dev > >> More help : https://help.launchpad.net/ListHelp > > > > > > > > > > -- > > Mike Scherbakov > > #mihgen > > > > -- > Dmitry Borodaenko > > -- > Mailing list: https://launchpad.net/~fuel-dev > Post to : [email protected] > Unsubscribe : https://launchpad.net/~fuel-dev > More help : https://help.launchpad.net/ListHelp >
-- Mailing list: https://launchpad.net/~fuel-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~fuel-dev More help : https://help.launchpad.net/ListHelp

