[jboss-user] [Clustering/JBoss] - Not removing failed node from partition

nsaunders Fri, 20 Jul 2007 07:47:07 -0700

Hi All,

I'm having some problems configuring jgroups failure detection. I'm using jBoss 
JBoss [Zion] 4.0.5.GA (build: CVSTag=Branch_4_0 date=200610162339) and jgroups 
2.2.9 beta2.


My setup consists of 2 nodes, each with 8 interfaces, paired into 4 redundant 
groups per node.


                |                               |
  |             | App 1                         | App 1
  |             |                               |
  |     -----------------               -----------------
  |   App 3     |               |       Mgmt    |               |  App3
  |   ------|     NODE 1        |---------------|     NODE 2    |-------
  |     |               |               |               |
  |     -----------------               -----------------
  |             |                               |
  |             |                               |
  |             |           Database            |
  |             ---------------------------------

I've used the -b option to bind all traffic to the management lan, and tweaked 
my application to bind only to the App 1 LAN. What I'm now trying to do is have 
the healthcheck/failure detection run explicitly over the Database LAN.

I've editted my cluster-service.xml file, which looks like this (editted for 
brevity):

        <Config>
  |             <UDP mcast_addr="${jboss.partition.udpGroup:228.1.2.3}" 
mcast_port="45566"
  |                ip_ttl="${jgroups.mcast.ip_ttl:8}" ip_mcast="true"
  |                mcast_recv_buf_size="2000000" mcast_send_buf_size="640000"
  |                ucast_recv_buf_size="2000000" ucast_send_buf_size="640000"
  |                loopback="false"/>
  |             <PING timeout="2000" num_initial_members="3"
  |                up_thread="true" down_thread="true"/>
  |             <MERGE2 min_interval="10000" max_interval="20000"/>
  |             <FD_SOCK srv_sock_bind_addr="192.168.104.56" 
down_thread="false" up_thread="false"/>
  |             <VERIFY_SUSPECT timeout="3000" num_msgs="3"
  |                up_thread="true" down_thread="true"/>
  |             <pbcast.NAKACK gc_lag="50" 
retransmit_timeout="300,600,1200,2400,4800"
  |                max_xmit_size="8192"
  |                up_thread="true" down_thread="true"/>
  |             <UNICAST timeout="300,600,1200,2400,4800" down_thread="true"/>
  |             <pbcast.STABLE desired_avg_gossip="20000" max_bytes="400000"
  |                up_thread="true" down_thread="true"/>
  |             <FRAG frag_size="8192"
  |                down_thread="true" up_thread="true"/>
  |             <pbcast.GMS join_timeout="5000" join_retry_timeout="2000"
  |                shun="true" print_local_addr="true"/>
  |             <pbcast.STATE_TRANSFER up_thread="true" down_thread="true"/>
  |          </Config>

This is the config on Node 2 - 192.168.104.56 is this nodes VIP for the 
database interface.

I have removed the standard FD algorithm, as it was failing to detect a failure 
when the database connection was broken (Based on what I read here 
http://www.redhat.com/docs/manuals/jboss/jboss-eap-4.2/doc/Server_Configuration_Guide/Failure_Detection_Protocols-FD.html
 - "Regular traffic from a node counts as if it is a live. So, the 
are-you-alive messages are only sent when there is no regular traffic to the 
node for sometime." - I assumed it must have seen other jGroups traffic across 
the management LAN and so counted it as alive).

The problem is that now the second node detects a failure correctly and sets 
itself to the master node, but getCurrentView() is still returning two nodes:

Fri Jul 20 14:34:17 BST 2007 MasterNode=true 192.168.105.158:1099 
192.168.105.155:1099

(192.168.105.158 & 192.168.105.155 are the Management interfaces for Nodes 1 & 
2 respecively)

What's interesting is when I plugged Node 1 back in, Node 2 breifly removed it 
from the partition before merging it back in:

Fri Jul 20 14:35:05 BST 2007 MasterNode=true 192.168.105.158:1099 
192.168.105.155:1099
  | Fri Jul 20 14:35:11 BST 2007 MasterNode=true 192.168.105.158:1099 
192.168.105.155:1099
  | Fri Jul 20 14:35:16 BST 2007 MasterNode=true 192.168.105.155:1099
  | Fri Jul 20 14:35:22 BST 2007 MasterNode=true 192.168.105.155:1099 
192.168.105.158:1099
  | Fri Jul 20 14:35:29 BST 2007 MasterNode=true 192.168.105.155:1099 
192.168.105.158:1099

Any advice or assistance would be greatly appreciated!

Kind Regards,

Neil Saunders.



View the original post : 
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4066233#4066233

Reply to the post : 
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4066233
_______________________________________________
jboss-user mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/jboss-user

[jboss-user] [Clustering/JBoss] - Not removing failed node from partition

Reply via email to