On 06/13/2013 02:37 PM, Kerry Bonin wrote:
I'm the system architect for a large program that has just shipped a major product, and Qpid is one of the foundations of our infrastructure. I thought I'd share a few things, and ask a few questions about my next step...
Thanks for taking the time Kerry, it's always great to get feedback!
We shipped using Qpid 0.14, am updating now. Our product is (currently) 100% Windows, although most subsystems are implemented in cross-platform C++ or Python. We are essentially a video surveillance and access control system, so we have high volumes of events, low millions per day in bursts up into the low hundreds per second. In addition to event transport, our infrastructure is a ESB model SOA built over Qpid. We have high reliability requirements – no single points of failure, fast failover and recovery, active-active services, load balancing, and encryption throughout all. Our biggest challenges came from the large reliability gap between the Windows and *nix implementations. We've contributed all of our fixes back, but under high load we had issues that burnt a few man months of senior developer time. Its now running reasonably well for us. Broker failover was a challenge. We ended up building a wrapper over the Qpid client library to abstract all the connection, sender, receiver, ect. objects, so applications didn't have to deal with tearing down and building up these objects when a broker died. (Which happened often under high load in early testing, and still happens often in our worst case testbeds where we take them down or break connections to test reliability.) Since federation didn't work on Windows, system splits were unacceptable, so we also had to implement failover recovery. This also required distributed and maintaining an ordered list of brokers. Again, this was a pain, but is now working. It would be nice if the client handled these things for us.
The client does have the ability to reconnect and re-establish the sessions, senders and receiver automatically. A list of brokers can also be provided and updated. There is additionally a utility that subscribes to the 'failover exchange' to get updates. That latter mechanism could be modified or copied for any sort of messaging based distribution of updates.
Can you give a bit more detail on what is missing or doesn't work as required?
Another serious challenge – can someone (if they haven't already) clean up the error propagation from the client to the application? Someone who was trying to recreate our broker failure detection asked me how we did it reliably – I had to give them a LIST of exceptions and error returns that we had to discover by trial and error to do this 100% of the time. This should be simple, and it isn't...
Can you give a bit more detail on this? I know at one point there were some cases where a ConnectionException would get thrown when failing first to connect (rather than a TransportFailure as expected), but I believe that was fixed already.
For our next release, we need to dramatically increase the volume of messages we handle. In leau of federation on Windows (when will that work?), I'm facing having to add code to manage pools of brokers, and dynamically load balance queues across brokers - once a broker is approaching 80% load, or dies, I need to move queues to other brokers, and coordinate these moves via my wrapper. I can do this, but I'd rather not. Does anyone have any other suggestions on how to handle 10M messages a second on (many) Windows boxes, spread across lots of queues at dynamic and unpredictably varying loads, that works with a fast failover and failover recovery mechanism? Kerry
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
