Re: [rabbitmq-discuss] A three-node cluster hangs completely in ec2

Alvaro Videla Thu, 22 Aug 2013 14:10:25 -0700

As a comment to Chris answer: Instagram uses RabbitMQ HA across
availability zones: https://twitter.com/rbranson/status/310461932618534913


More details about their setup here:
http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html


On Thu, Aug 22, 2013 at 10:20 PM, Chris <[email protected]> wrote:

> Hi Jon,
>
> I'm not 100% familiar with Amazon's availability zones and how they work,
> but... it sounds to me like they are in different locations and different
> networks?  If so, clustering is probably not a good idea in this case.
>  See: http://www.rabbitmq.com/partitions.html
>
> I don't know if this is the cause for the issues you've seen, but it may
> be the cause of issues in the future...  On the other hand, if I am wrong
> about availabity zones, then you can safely disregard this message! ;-)
>
> -Chris
>
>
>
> On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <[email protected]> wrote:
>
>> We've seen this happen twice now and each time it's been a pain to work
>> around (we ended up creating a whole new cluster each time). Here's the
>> scenario we have seen:
>>
>> Our setup:
>>
>>    1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each
>>    node is in a different availability zone in the US-EAST region on AWS.
>>    We'll call them nodes A, B, and C
>>    2. Each queue is using an HA policy
>>    3. All queues are durable
>>    4. We Basic.Publish with DeliveryMode=2
>>    5. All clients are initially connected to node A
>>
>> The scenario:
>>
>>    1. Node A is shutdown (the last time I did it via 'sudo
>>    /etc/init.d/rabbitmq-server stop
>>    2. All connected clients see the shutdown and successfully transition
>>    to using one of the other nodes. About half connect to node B and the 
>> other
>>    half connect to node C
>>    3. We notice that a few of the queues still show their "node" as
>>    being node A, even though it is not currently running.
>>    4. Node A is brought back online. The RabbitMQ management console
>>    (webapp) shows everything is fine on the homepage.
>>    5. When A comes back online, those queues that show A as their 'node'
>>    now show zero mirrors.
>>    6. I attempt to delete the queue via the management webapp. At that
>>    point all three nodes become 100% unresponsive. The management webapp 
>> fails
>>    to respond and all communication in our application stops. CPU fluctuates
>>    between 10-40% on but memory doesn't seem to be leaking. It's difficult to
>>    know what is happening because rabbitmqctl is also unresponsive. Attempts
>>    to gracefully stop the nodes all hang.
>>
>> Does anybody have experience with this? What additional information
>> should I provide? It's causing a lot of stress and confuses the heck out of
>> me. Any guidance is much appreciated.
>>
>>
>> _______________________________________________
>> rabbitmq-discuss mailing list
>> [email protected]
>> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>>
>>
>
> _______________________________________________
> rabbitmq-discuss mailing list
> [email protected]
> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"rabbitmq-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/rabbitmq-discuss.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [rabbitmq-discuss] A three-node cluster hangs completely in ec2

Reply via email to