Re: Qpid post-mortem and request for suggestions for (my) next release challenge (10M msgs/sec on Windows)

Gordon Sim Tue, 18 Jun 2013 03:11:18 -0700

On 06/17/2013 07:22 PM, Kerry Bonin wrote:

On Mon, Jun 17, 2013 at 7:26 AM, Gordon Sim <[email protected]> wrote:

On 06/14/2013 03:58 PM, Kerry Bonin wrote:

  - to prevent network splits, how are recovered brokers monitored?  When a
failed broker recovers, do clients switch back?  How often / aggressively
checked?


No, there is no switch back behaviour in the client. The new HA code
allows a broker to be classed as in a backup or primary role and backups
will reject or kick off any clients causing them to failover. Whatever
cluster management solution was in use would then detect changes to primary
and use QMF to tell each broker what their role was.



I'd like to suggest that this is a serious deficiency.  It would be nice if
it was possible to have some HA features without having to deploy
clustering.

Just to be clear, what I'm referring to above does not involve the oldclustering solution, tightly bound to corosync.

The new HA has no external dependencies. It does however leave the taskof managing the cluster to some external system (rgmanager, pacemakeretc), which would be responsible for deciding who is the primary,detecting failure, electing a new primary, handling restart and failbacketc.

The broker simply provides the hooks for the cluster management solutionto notify each broker in the cluster of their role.

It does rely on federation, but once that issue is resolved it shouldwork on windows as well.

Though I haven't actually tried, I suspect it may be possible to simplyuse the QMF exposed hooks without needing any replication. (Certainly Iwould expect it would not require a great deal of modification to getthat to work).

While the lack of clustering for Windows makes this an obvious
problem for Windows users, I'd certainly argue that *nix users might also
like to have failover and recovery without clustering.  And without
clustering, failover without recovery is kind of useless as a HA feature
due to the split use case.  (i.e. 2 clients talking through broker A,
broker A fails and 2 clients failover to broker B.  Broker A comes back
online.  Another client joins, connects to broker A.  We now have a split,
new client cannot see old clients.)

  - how is the application notified on broker failure, connection failover,
recovery?


It isn't. Any threads using the connection will essentially block until
either the connection was re-established or until the configured limit was
reached and the client gives up trying.

Now I write this I do recall a conversation on this topic with you some
time back, with this being an issue for you.



I'd like to suggest that this remains a serious deficiency.  In most
software, if a critical failure occurs down in middleware or its supporting
infrastructure, it would be nice if the middleware library could report
this to the application, so a system administrator could do something about
it.  While its certainly possible to rely on external monitoring systems to
notify an admin, its also a good practice to have an application display
some sort of error condition.  A broker failure in an ESB SOA application
is a critical failure, and the application needs to inform its user that it
has lost connectivity to the system.

I agree and I would like to fix that deficiency. I'm going to be workingon reconnect/replay again in conjunction with AMQP 1.0 and will see if Ican come up with a solution then. I have created a JIRA:https://issues.apache.org/jira/browse/QPID-4932


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Qpid post-mortem and request for suggestions for (my) next release challenge (10M msgs/sec on Windows)

Reply via email to