Dear all,

Please make sure that all discussions that occur elsewhere (this ML
thread, chats, etc.) end up reflected in the LaunchPad bug (even if a
theory is discussed and then eliminated, it's useful to have it
mentioned in the bug so that other people don't repeat the same line
of investigation). I originally emailed fuel-dev@ to only attract
attention to the problem, I did not intend to split the discussion.

Thanks,

On Fri, Feb 28, 2014 at 8:35 AM, Matthew Mosesohn
<[email protected]> wrote:
> I started reaching out to our community folks, Dina and Dmitry.
>
> We tried a few variations, but the same result: nova and cinder
> dislike having the AMQP backend shifted from underneath it.
>
> If we remove haproxy and connect directly to RabbitMQ on a virtual IP,
> all nova and cinder services die when we shift the virtual IP to
> another node. Neutron somehow survives and reconnects in about 25
> seconds and picks up where it left off.
>
> For the record, we're running on 2013.2.2 code. Dmitry Mescheryakov
> asked me to provide a diff of what the RPC code is between neutron and
> cinder to maybe determine why Neutron can resume connections, but
> Cinder surely doesn't. Here is this diff:
> http://paste.openstack.org/show/uXyeYUGxMiAhmcGlK8VZ/
>
> For more info:
> Errors we see in Cinder logs: http://pastie.org/private/w8iigjzijfczvsw5ddelwq
> Errors we see in Neutron logs: 
> http://pastie.org/private/uelxryhbr42jijip0loe2w
>
> In the bug, mentioned earlier in this thread, we have a diagnostic snapshot.
>
> We're still digging for leads to fix this HA failover issue.
>
> -Matthew
>
> On Fri, Feb 28, 2014 at 1:12 PM, Vladimir Kuklin <[email protected]> wrote:
>> It will not help if you shut down the controller. The problem is that you
>> have  hanged AMQP sessions which kombu driver does not look to handle
>> correctly.
>>
>>
>> On Fri, Feb 28, 2014 at 1:09 PM, Bogdan Dobrelya <[email protected]>
>> wrote:
>>>
>>> On 02/28/2014 05:44 AM, Dmitry Borodaenko wrote:
>>> > Team,
>>> >
>>> > Me and Ryan have spent all day investigating
>>> > https://bugs.launchpad.net/fuel/+bug/1285449
>>> >
>>> > What we have found so far confirms that this is a critical bug that
>>> > absolutely must be resolved before 4.1 is released.  I have documented
>>> > our findings in the bug comments, someone please take over the
>>> > investigation when you come to the office tomorrow morning MSK time.
>>> >
>>> > I have a feeling that once the root cause is found, the fix will be
>>> > low-impact and will involve either change in HAProxy configuration for
>>> > RabbitMQ, a patch/upgrade of HAProxy or kombu, or something similar.
>>> > But first we need to understand what exactly breaks, and why this only
>>> > affects some services and not all of them.
>>> >
>>> > Thanks,
>>> >
>>>
>>> Here is recent rabbitMQ discussion quote from the
>>> Fuel-conductors-support team skype chat (RU + translation):
>>>
>>> Wednesday, February 26, 2014
>>> [4:00:10 PM] Maxim Yefimov: Коллеги, вопрос есть:
>>> (I have a question)
>>>
>>> listen rabbitmq-openstack
>>>   bind 192.168.0.2:5672
>>>   balance  roundrobin
>>>
>>>   server  node-1 192.168.0.3:5673   check inter 5000 rise 2 fall 3
>>>   server  node-2 192.168.0.4:5673   check inter 5000 rise 2 fall 3  backup
>>>   server  node-3 192.168.0.5:5673   check inter 5000 rise 2 fall 3  backup
>>>
>>> [4:01:01 PM] Maxim Yefimov: Зачем одновременно roundrobin и
>>> active-passive?
>>> (Why do we use roundrobin and active-passive at once for RabbitMQ?)
>>>
>>> [4:01:39 PM] Miroslav Anashkin: Чтобы коннект не рвался
>>> (To make sure the connection wouldn't break)
>>>
>>> [4:02:01 PM] Miroslav Anashkin: У кролика кластер существует строго в
>>> виде мастер-слейв
>>> (RabbitMQ clustering is restricted to master-slave only)
>>>
>>> [4:02:23 PM] Miroslav Anashkin: Соответственно даже если какая-то нода с
>>> запросом к слейву придет - та его на мастер отправит
>>> (Hence, any node's query to the RabbitMQ slave would have been re-sent
>>> to the master)
>>>
>>> [4:02:52 PM] Miroslav Anashkin: Поэтому сделали так чтобы ХАПрокси
>>> всегда всех посылал на одну ноду
>>> (Thats why HAproxy always redirects all queries to the single RabbitMQ
>>> node)
>>>
>>> And I'm not clear with this explanation, honestly. Why couldn't we make
>>> OS establish direct connections to arbitrary (LB) chosen RabbitMQ nodes
>>> skipping HAproxy at all? (because of this: "any node's query to the
>>> RabbitMQ slave would have been re-sent to the master")
>>>
>>> Could that resolve the issue? I think I will investigate this option as
>>> well.
>>>
>>>
>>> --
>>> Best regards,
>>> Bogdan Dobrelya,
>>> Skype #bogdando_at_yahoo.com
>>> Irc #bogdando
>>>
>>> --
>>> Mailing list: https://launchpad.net/~fuel-dev
>>> Post to     : [email protected]
>>> Unsubscribe : https://launchpad.net/~fuel-dev
>>> More help   : https://help.launchpad.net/ListHelp
>>
>>
>>
>>
>> --
>> Yours Faithfully,
>> Vladimir Kuklin,
>> Senior Deployment Engineer,
>> Mirantis, Inc.
>> +7 (495) 640-49-04
>> +7 (926) 702-39-68
>> Skype kuklinvv
>> 45bk3, Vorontsovskaya Str.
>> Moscow, Russia,
>> www.mirantis.com
>> www.mirantis.ru
>> [email protected]
>>
>> --
>> Mailing list: https://launchpad.net/~fuel-dev
>> Post to     : [email protected]
>> Unsubscribe : https://launchpad.net/~fuel-dev
>> More help   : https://help.launchpad.net/ListHelp
>>
>
> --
> Mailing list: https://launchpad.net/~fuel-dev
> Post to     : [email protected]
> Unsubscribe : https://launchpad.net/~fuel-dev
> More help   : https://help.launchpad.net/ListHelp



-- 
Dmitry Borodaenko

-- 
Mailing list: https://launchpad.net/~fuel-dev
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~fuel-dev
More help   : https://help.launchpad.net/ListHelp

Reply via email to