On 06/13/2013 02:37 PM, Kerry Bonin wrote:
I'm the system architect for a large program that has just shipped a major
product, and Qpid is one of the foundations of our infrastructure. I
thought I'd share a few things, and ask a few questions about my next
step...

Thanks for taking the time Kerry, it's always great to get feedback!

  We shipped using Qpid 0.14, am updating now. Our product is (currently)
100% Windows, although most subsystems are implemented in cross-platform
C++ or Python. We are essentially a video surveillance and access control
system, so we have high volumes of events, low millions per day in bursts
up into the low hundreds per second. In addition to event transport, our
infrastructure is a ESB model SOA built over Qpid. We have high reliability
requirements – no single points of failure, fast failover and recovery,
active-active services, load balancing, and encryption throughout all.

  Our biggest challenges came from the large reliability gap between the
Windows and *nix implementations. We've contributed all of our fixes back,
but under high load we had issues that burnt a few man months of senior
developer time. Its now running reasonably well for us.

  Broker failover was a challenge. We ended up building a wrapper over the
Qpid client library to abstract all the connection, sender, receiver, ect.
objects, so applications didn't have to deal with tearing down and building
up these objects when a broker died. (Which happened often under high load
in early testing, and still happens often in our worst case testbeds where
we take them down or break connections to test reliability.) Since
federation didn't work on Windows, system splits were unacceptable, so we
also had to implement failover recovery.  This also required distributed
and maintaining an ordered list of brokers. Again, this was a pain, but is
now working. It would be nice if the client handled these things for us.

The client does have the ability to reconnect and re-establish the sessions, senders and receiver automatically. A list of brokers can also be provided and updated. There is additionally a utility that subscribes to the 'failover exchange' to get updates. That latter mechanism could be modified or copied for any sort of messaging based distribution of updates.

Can you give a bit more detail on what is missing or doesn't work as required?

  Another serious challenge – can someone (if they haven't already) clean up
the error propagation from the client to the application? Someone who was
trying to recreate our broker failure detection asked me how we did it
reliably – I had to give them a LIST of exceptions and error returns that
we had to discover by trial and error to do this 100% of the time. This
should be simple, and it isn't...

Can you give a bit more detail on this? I know at one point there were some cases where a ConnectionException would get thrown when failing first to connect (rather than a TransportFailure as expected), but I believe that was fixed already.

  For our next release, we need to dramatically increase the volume of
messages we handle. In leau of federation on Windows (when will that
work?), I'm facing having to add code to manage pools of brokers, and
dynamically load balance queues across brokers - once a broker is
approaching 80% load, or dies, I need to move queues to other brokers, and
coordinate these moves via my wrapper.  I can do this, but I'd rather not.
Does anyone have any other suggestions on how to handle 10M messages a
second on (many) Windows boxes, spread across lots of queues at dynamic and
unpredictably varying loads, that works with a fast failover and failover
recovery mechanism?


Kerry



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to