On 07/28/2014 09:20 AM, Bogdan Dobrelya wrote:
Hello.
I'd like to bring your attention to major RPC failover issue in
impl_rabbit.py [0]. There are several *related* patches and a number of
concerns should be considered as well:
- Passive exchanges fix [1] (looks like the problem is much deeper than
it seems though).
- the first version of the fix [2] which makes the producer to declare a
queue and bind it to exchange as well as consumer does.
- Making all RPC involved reply_* queues durable in order to preserve
them in RabbitMQ after failover (there could be a TTL for such a queues
as well)
- RPC throughput tuning patch [3]

I believe the issue [0] should be at least prioritized and assigned to
some milestone.

I think the real issue is the lack of clarity around what guarantees are made by the API.

Is it the case that an RPC call should never fail (i.e. never time out) due to failover? Either way, the answer to this should be very clear.

If failures may occur, then the calling code needs to handle that. If eliminating failures is part of the 'contract' then the library should have a clear strategy for ensuring (and testing) this.

Another possible scenario is that the connection is lost immediately after writing the request message to the socket (but before it is processed by the rabbit broker). In this case the issue is that the request is not confirmed, so it can complete before it is 'safe'. In other words requests are unreliable.

My own view is that if you want to avoid time outs on failover, the best approach is to have olso.messaging retry the entire request regardless of the point it had reached in the previous attempt. I.e. rather than trying to make delivery of responses reliable, assume that both requests and responses are unreliable and re-issue the request immediately on failover. (The retry logic could even be made independent of any driver if desired).

This is perhaps a bigger change, but I think it is more easy to get right and will also be more scalable and performant since it doesn't require replication of every queue and every message.



[0] https://bugs.launchpad.net/oslo.messaging/+bug/1338732
[1] https://review.openstack.org/#/c/109373/
[2]
https://github.com/noelbk/oslo.messaging/commit/960fc26ff050ca3073ad90eccbef1ca95712e82e
[3] https://review.openstack.org/#/c/109143/



_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to