As a comment to Chris answer: Instagram uses RabbitMQ HA across availability zones: https://twitter.com/rbranson/status/310461932618534913
More details about their setup here: http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html On Thu, Aug 22, 2013 at 10:20 PM, Chris <[email protected]> wrote: > Hi Jon, > > I'm not 100% familiar with Amazon's availability zones and how they work, > but... it sounds to me like they are in different locations and different > networks? If so, clustering is probably not a good idea in this case. > See: http://www.rabbitmq.com/partitions.html > > I don't know if this is the cause for the issues you've seen, but it may > be the cause of issues in the future... On the other hand, if I am wrong > about availabity zones, then you can safely disregard this message! ;-) > > -Chris > > > > On Thu, Aug 22, 2013 at 3:17 PM, Jon Dokulil <[email protected]> wrote: > >> We've seen this happen twice now and each time it's been a pain to work >> around (we ended up creating a whole new cluster each time). Here's the >> scenario we have seen: >> >> Our setup: >> >> 1. Three RabbitMQ 3.1.5 nodes running on the Amazon Linux AMI. Each >> node is in a different availability zone in the US-EAST region on AWS. >> We'll call them nodes A, B, and C >> 2. Each queue is using an HA policy >> 3. All queues are durable >> 4. We Basic.Publish with DeliveryMode=2 >> 5. All clients are initially connected to node A >> >> The scenario: >> >> 1. Node A is shutdown (the last time I did it via 'sudo >> /etc/init.d/rabbitmq-server stop >> 2. All connected clients see the shutdown and successfully transition >> to using one of the other nodes. About half connect to node B and the >> other >> half connect to node C >> 3. We notice that a few of the queues still show their "node" as >> being node A, even though it is not currently running. >> 4. Node A is brought back online. The RabbitMQ management console >> (webapp) shows everything is fine on the homepage. >> 5. When A comes back online, those queues that show A as their 'node' >> now show zero mirrors. >> 6. I attempt to delete the queue via the management webapp. At that >> point all three nodes become 100% unresponsive. The management webapp >> fails >> to respond and all communication in our application stops. CPU fluctuates >> between 10-40% on but memory doesn't seem to be leaking. It's difficult to >> know what is happening because rabbitmqctl is also unresponsive. Attempts >> to gracefully stop the nodes all hang. >> >> Does anybody have experience with this? What additional information >> should I provide? It's causing a lot of stress and confuses the heck out of >> me. Any guidance is much appreciated. >> >> >> _______________________________________________ >> rabbitmq-discuss mailing list >> [email protected] >> https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss >> >> > > _______________________________________________ > rabbitmq-discuss mailing list > [email protected] > https://lists.rabbitmq.com/cgi-bin/mailman/listinfo/rabbitmq-discuss > > -- You received this message because you are subscribed to the Google Groups "rabbitmq-discuss" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/rabbitmq-discuss. For more options, visit https://groups.google.com/groups/opt_out.
