RE: An ill borker brings down the whole cluster

Shan Wang Tue, 03 Nov 2009 06:44:10 -0800

Hi Carl,

Does that mean if there is a un-responsive broker in the cluster, all the other 
brokers will have to wait for it? I was getting the impression that if a broker 
in a cluster is not behaving normally, client should failover to another broker 
automatically.

In my case, I haven't set any heartbeat in ConnectionSettings explicitly, the 
cluster worked most of the time, so I guess the default heartbeat interval is 
fine. If the heartbeat works fine, then the ill broker must have passed the 
"heartbeat check", ie, it sends out heartbeat normally, may even receive 
messages from producer client normally, but somehow some of its threads was 
dead-lock so that it couldn't respond to consumer request and admin requests. 
Except the plug-in, is there anything else I can do to remove un-responsive 
brokers?

Unfortunately the channel error I sent in the original mail is the only error 
log I have(log level was set to warning+). There's one factor may contribute to 
the problem: We run qpid-tool on both of the two brokers every 5 minutes to 
collect stats. Because of the cluster, the two qpid-tool processes are actually 
monitoring the same queues. Is it possible that the two qpid-tool had conflict 
problem there? Is there any known problem for this?

-----Original Message-----
From: Carl Trieloff [mailto:[email protected]]
Sent: 03 November 2009 13:53
To: [email protected]
Cc: [email protected]
Subject: Re: An ill borker brings down the whole cluster

I don't have enough info to comment on the root cause, Maybe Alan can
based on the log snippet, however there is a pulg-in module that can be
run on nodes in a cluster that will
remove any stalled node in the cluster so that the rest of the cluster
can continue to operate as normal.

For example, if you sig-stop one broker in a cluster, then the rest of
teh cluster will continue to run, but AIS will cache for the node that
is stopped. It is required that node be evicted at some point if it does
not get a sig-cont after a period of time. The watchdog plugin does this
for you, at which point you can rejoin another node.

i.e. running the watchdog would have removed the un-responsive broker in
your example below.  The second part is to understand why it was
unresponsive.

Carl.

Shan Wang wrote:
> Hi All,
>
> We have two qpid 0.5 brokers running in cluster mode on two different boxes. 
> The cluster works fine in normal cases, ie, if broker1 is shutdown cleanly, 
> broker2 will keep on serving clients. But today we found one broker suddenly 
> lost response to all connected clients and admin tools. All producer and 
> consumer clients are still connected but failed to consume any messages from 
> the queue. The command line admin tool failed with a time out error. The only 
> error message we found is in the log of broker 1, which said this:
>
> 2009-oct-31 10:17:49 error 172.27.34.201:9908(READY/error) channel error 
> 157487219 on 172.27.34.201:9908-389(local): transport-busy: Channel 1 already 
> attached to [email protected]
> 64-4e49-9bee-0538532fe261 (qpid/amqp_0_10/SessionHandler.cpp:150) 
> (unresolved: 172.27.34.201:9908 172.27.34.202:13287 )
>
> After only restarted broker 1, everything starts to work again. So 
> surprisingly it seems when one of the brokers in the cluster suffered a 
> problem, the whole cluster just stalled, at least from the consumer's point 
> of view ( I can't be sure if the producer was working during the down time, 
> after back to normal, consumer did receive messages sent sometime ago ). 
> Consumer program uses FailoverManager and AsyncSession, basically not far 
> from the failover example in the qpid developing doc. So can anyone please 
> tell me what the above error message means and have we seen similar problems 
> to the cluster before?
>
>
> Regards,
> Shan
>
>
>
> ________________________________
> The information contained in this email is strictly confidential and for the 
> use of the addressee only, unless otherwise indicated. If you are not the 
> intended recipient, please do not read, copy, use or disclose to others this 
> message or any attachment. Please also notify the sender by replying to this 
> email or by telephone (+44 (0)20 7896 0011) and then delete the email and any 
> copies of it. Opinions, conclusions (etc.) that do not relate to the official 
> business of this company shall be understood as neither given nor endorsed by 
> it. IG Index Ltd is a company registered in England and Wales under number 
> 01190902. VAT registration number 761 2978 07. Registered Office: Friars 
> House, 157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by 
> the Financial Services Authority. FSA Register number 114059.
>
>

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

The information contained in this email is strictly confidential and for the 
use of the addressee only, unless otherwise indicated. If you are not the 
intended recipient, please do not read, copy, use or disclose to others this 
message or any attachment. Please also notify the sender by replying to this 
email or by telephone (+44 (0)20 7896 0011) and then delete the email and any 
copies of it. Opinions, conclusions (etc.) that do not relate to the official 
business of this company shall be understood as neither given nor endorsed by 
it. IG Index Ltd is a company registered in England and Wales under number 
01190902. VAT registration number 761 2978 07. Registered Office: Friars House, 
157-168 Blackfriars Road, London SE1 8EZ. Authorised and regulated by the 
Financial Services Authority. FSA Register number 114059.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

RE: An ill borker brings down the whole cluster

Reply via email to