Re: An ill borker brings down the whole cluster

Alan Conway Fri, 27 Nov 2009 07:49:28 -0800

On 11/04/2009 10:36 AM, Shan Wang wrote:

Hi Alan,


The whole cluster lost response, but qpid-tool is still able to connect to 
broker2 but not broker1, based on that I suppose it's broker1 became ill, and 
restart of broker1 cured the whole cluster.

The full log of broker1 from 31-OCT is attached. Now we have turned log levels 
to info+ and will apply --log-enable=debug+:cluster later.

Before hanging, there are many clients sending messages to the cluster, I don't 
know the exact number of clients but usually between 150-200, the update rate 
was about 5-10 MB/minute. The receiver was receiving messages ok but suddenly 
stopped working. I believe the receiver stopped working before sender, because 
after things back to normal, we can see very old messages in the receiver's 
log, but not relative recent messages commited after the problem.

The affected system carries pretty serious tasks so I can't play with it as I 
wish, nor did I try the sender/receiver example. But as my latest email said, 
the problem re-occurred this morning, this time with broker2.


The given link could be a similar issue, but the question is what caused errors 
in cluster?


Sorry for taking so long to get back to  you.

I think you're seeing a combination of 2 issues:

https://bugzilla.redhat.com/show_bug.cgi?id=529489 could cause the "alreadyattached" error if you have a lot of sessions.

https://bugzilla.redhat.com/show_bug.cgi?id=514487 could cause the cluster tohang if you get "already attached" errors simultaneously on 2 different clustermembers.


Both of these are fixed for the next release

-----Original Message-----
From: Alan Conway [mailto:[email protected]]
Sent: 04 November 2009 14:10
To: [email protected]
Cc: [email protected]; [email protected]
Subject: Re: An ill borker brings down the whole cluster

On 11/03/2009 04:41 PM, Shan Wang wrote:

Client side we are still using 0.4, I'm not sure about the exact version, 
should be last version before 0.5.
Cluster side we are using 0.5.752581-26.el5.

Unfortunately I haven't got the environment to build qpid myself so I can't use 
latest trunk.


I'd like to try an reproduce your issue, need some more details:

On 11/03/2009 06:13 AM, Shan Wang wrote:

Hi All,

We have two qpid 0.5 brokers running in cluster mode on two different
boxes. The cluster works fine in normal cases, ie, if broker1 is
shutdown cleanly, broker2 will keep on serving clients. But today we
found one broker suddenly lost response to all connected clients and
admin tools. All producer and consumer clients are still connected
but failed to consume any messages from the queue.


Just to clarify: did only one broker become unresponsive or did both of them
become unresponsive?

The command line

admin tool failed with a time out error. The only error message we
found is in the log of broker 1, which said this:

2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel
error 157487219 on 172.27.34.201:9908-389(local): transport-busy:
Channel 1 already attached to [email protected]
64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150)
(unresolved: 172.27.34.201:9908 172.27.34.202:13287 )


Do you still have the full logs of both brokers at the time they were
unresponsive? Can you run the broker with

   --log-enable=notify+ --log-enable=debug+:cluster

for future runs so we can hopefully get a bit more information about what the
cluster is doing at the time of the hang?

What are your clients doing? Can you reproduce the problem using the sender and
receiver examples?

How many clients are running against each broker?

How easy is it to reproduce the problem?


After only restarted broker 1, everything starts to work again. So
surprisingly it seems when one of the brokers in the cluster suffered
a problem, the whole cluster just stalled, at least from the
consumer's point of view ( I can't be sure if the producer was
working during the down time, after back to normal, consumer did
receive messages sent sometime ago ). Consumer program uses
FailoverManager and AsyncSession, basically not far from the failover
example in the qpid developing doc. So can anyone please tell me what
the above error message means and have we seen similar problems to
the cluster before?


Yes I've seen similar problems before, but believe them all to be fixed at this
point on trunk. It might be the issue fixed by

http://svn.apache.org/viewvc?view=revision&revision=799687

If I can reproduce the problem then I can verify if it is fixed on trunk.

Cheers,
Alan.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]


The information contained in this email is strictly confidential and for the 
use of the addressee only, unless otherwise indicated. If you are not the 
intended recipient, please do not read, copy, use or disclose to others this 
message or any attachment. Please also notify the sender by replying to this 
email or by telephone (+44 (0)20 7896 0011) and then delete the email and any 
copies of it. Opinions, conclusions (etc.) that do not relate to the official 
business of this company shall be understood as neither given nor endorsed by 
it. IG Index Ltd is a company registered in England and Wales under number 
01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 
157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the 
Financial Services Authority. FSA Register number 114059.





---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]



---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Re: An ill borker brings down the whole cluster

Reply via email to