Re: An ill borker brings down the whole cluster

Alan Conway Wed, 04 Nov 2009 06:09:39 -0800

On 11/03/2009 04:41 PM, Shan Wang wrote:

Client side we are still using 0.4, I'm not sure about the exact version, 
should be last version before 0.5.
Cluster side we are using 0.5.752581-26.el5.


Unfortunately I haven't got the environment to build qpid myself so I can't use 
latest trunk.


I'd like to try an reproduce your issue, need some more details:

On 11/03/2009 06:13 AM, Shan Wang wrote:

Hi All,

We have two qpid 0.5 brokers running in cluster mode on two different
boxes. The cluster works fine in normal cases, ie, if broker1 is
shutdown cleanly, broker2 will keep on serving clients. But today we
found one broker suddenly lost response to all connected clients and
admin tools. All producer and consumer clients are still connected
but failed to consume any messages from the queue.

Just to clarify: did only one broker become unresponsive or did both of thembecome unresponsive?


The command line

admin tool failed with a time out error. The only error message we
found is in the log of broker 1, which said this:

2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
Channel 1 already attached to [email protected]
64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
(unresolved: 172.27.34.201:9908 172.27.34.202:13287 )

Do you still have the full logs of both brokers at the time they wereunresponsive? Can you run the broker with


 --log-enable=notify+ --log-enable=debug+:cluster

for future runs so we can hopefully get a bit more information about what thecluster is doing at the time of the hang?

What are your clients doing? Can you reproduce the problem using the sender andreceiver examples?


How many clients are running against each broker?

How easy is it to reproduce the problem?


After only restarted broker 1, everything starts to work again. So
surprisingly it seems when one of the brokers in the cluster suffered
a problem, the whole cluster just stalled, at least from the
consumer's point of view ( I can't be sure if the producer was
working during the down time, after back to normal, consumer did
receive messages sent sometime ago ). Consumer program uses
FailoverManager and AsyncSession, basically not far from the failover
example in the qpid developing doc. So can anyone please tell me what
the above error message means and have we seen similar problems to
the cluster before?

Yes I've seen similar problems before, but believe them all to be fixed at thispoint on trunk. It might be the issue fixed by


http://svn.apache.org/viewvc?view=revision&revision=799687

If I can reproduce the problem then I can verify if it is fixed on trunk.

Cheers,
Alan.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Re: An ill borker brings down the whole cluster

Reply via email to