Bugs item #808170, was opened at 2003-09-17 21:19
Message generated for change (Comment added) made by ejort
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=376685&aid=808170&group_id=22866

Category: Clustering
Group: v3.2
Status: Open
Resolution: None
Priority: 5
Submitted By: Adrian Brock (ejort)
Assigned to: Bela Ban (belaban)
Summary: Partition problems 3.2.2RC4

Initial Comment:
JBoss-3.2.2RC4 (current CVS)

Java1.4.2

Redhat 9



I'm seeing problems with nodes in a partition discovering

each other.



Sometimes it works,



Sometimes is doesn't (see attached console.log and

console2.log)

I have cluster logs as well if you need them.

Shutdown blocks on the partition when it fails 

(see threaddumps at end of logs)



Sometimes I get a NullPointerException

03-09-17 19:07:32,545 DEBUG

[org.jboss.ha.framework.interfaces.HAPartition.Def

aultPartition] Get current members

2003-09-17 19:07:32,546 INFO 

[org.jboss.ha.framework.interfaces.HAPartition.Def

aultPartition] Number of cluster members: 2

2003-09-17 19:07:32,556 INFO 

[org.jboss.ha.framework.interfaces.HAPartition.Def

aultPartition] Other members: 1

2003-09-17 19:07:32,566 WARN 

[org.jboss.ha.framework.interfaces.HAPartition.Def

aultPartition] No additional information has been found

in the JavaGroup

address

: make sure you are running with a correct version of

JavaGroups and

that the pr

otocol  you are using supports the 'additionalData'

behaviour

2003-09-17 19:07:32,577 ERROR

[org.jboss.ha.framework.server.ClusterPartition] S

tarting failed

java.lang.NullPointerException

        at

org.jboss.ha.framework.server.HAPartitionImpl.verifyNodeIsUnique(HAPa

rtitionImpl.java:763)

        at

org.jboss.ha.framework.server.HAPartitionImpl.startPartition(HAPartit

ionImpl.java:236)

        at

org.jboss.ha.framework.server.ClusterPartition.startService(ClusterPa

rtition.java:293)

        at

org.jboss.system.ServiceMBeanSupport.start(ServiceMBeanSupport.java:1

92)

        at

sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)

        at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces



Regards,

Adrian

----------------------------------------------------------------------

>Comment By: Adrian Brock (ejort)
Date: 2003-09-23 19:14

Message:
Logged In: YES 
user_id=9459

Looks like the problem is the config event is temporarily not 

in any queues so waiting for the queues to become empty does

not work.



Thread[DownHandler (STABLE),5,main] Remove internal Queue

(1) messages Event[type=CONFIG,

[EMAIL PROTECTED]

Thread[DownHandler (STABLE),5,main] removeInternal waiting

for remove_mutex

Thread[DownHandler (STABLE),5,main] removeInternal got

remove_mutex

Thread[DownHandler (STABLE),5,main] removeInternal releasing

remove_mutex

Thread[main,5,main] waitUnitEmpty releasing remove_mutex

Thread[main,5,main] flushed down queue Queue (0) messages

Thread[main,5,main] flushing down queue Queue (0) messages

Thread[main,5,main] waitUnitEmpty waiting for remove_mutex

Thread[main,5,main] waitUnitEmpty got remove_mutex

Thread[main,5,main] waitUnitEmpty releasing remove_mutex

Thread[main,5,main] flushed down queue Queue (0) messages

Thread[main,5,main] flushing down queue Queue (0) messages

Thread[main,5,main] waitUnitEmpty waiting for remove_mutex

Thread[main,5,main] waitUnitEmpty got remove_mutex

Thread[main,5,main] waitUnitEmpty releasing remove_mutex

Thread[main,5,main] flushed down queue Queue (0) messages

Thread[main,5,main] flushing down queue Queue (0) messages

Thread[main,5,main] waitUnitEmpty waiting for remove_mutex

Thread[main,5,main] waitUnitEmpty got remove_mutex

Thread[main,5,main] waitUnitEmpty releasing remove_mutex

Thread[main,5,main] flushed down queue Queue (0) messages

Thread[main,5,main] flushing down queue Queue (0) messages

Thread[main,5,main] waitUnitEmpty waiting for remove_mutex

Thread[main,5,main] waitUnitEmpty got remove_mutex

Thread[main,5,main] waitUnitEmpty releasing remove_mutex

Thread[main,5,main] flushed down queue Queue (0) messages

Thread[main,5,main] flushing down queue Queue (0) messages

Thread[main,5,main] waitUnitEmpty waiting for remove_mutex

Thread[main,5,main] waitUnitEmpty got remove_mutex

Thread[main,5,main] waitUnitEmpty releasing remove_mutex

Thread[main,5,main] flushed down queue Queue (0) messages

Thread[main,5,main] flushing down queue Queue (0) messages

Thread[main,5,main] waitUnitEmpty waiting for remove_mutex

Thread[main,5,main] waitUnitEmpty got remove_mutex

Thread[main,5,main] waitUnitEmpty releasing remove_mutex

Thread[main,5,main] flushed down queue Queue (0) messages

Thread[main,5,main] Checking for additional data

Thread[DownHandler (STABLE),5,main] Removed Queue (0)

messages Event[type=CONFIG, [EMAIL PROTECTED]

Thread[DownHandler (STABLE),5,main] adding Queue (0)

messages Event[type=CONFIG, [EMAIL PROTECTED]

Thread[DownHandler (STABLE),5,main] add waiting for mutext

Queue (0) messages Event[type=CONFIG,

[EMAIL PROTECTED]

Thread[DownHandler (STABLE),5,main] notifying addition Queue

(1) messages Event[type=CONFIG,

[EMAIL PROTECTED]

Thread[DownHandler (STABLE),5,main] notified addition Queue

(1) messages Event[type=CONFIG,

[EMAIL PROTECTED]

Thread[DownHandler (STABLE),5,main] remove() waiting for

mutext Queue (0) messages

Thread[DownHandler (NAKACK),5,main] Remove internal Queue

(1) messages Event[type=CONFIG,

[EMAIL PROTECTED]

Thread[DownHandler (NAKACK),5,main] removeInternal waiting

for remove_mutex

Thread[DownHandler (NAKACK),5,main] removeInternal got

remove_mutex

Thread[DownHandler (NAKACK),5,main] removeInternal releasing

remove_mutex

Thread[DownHandler (NAKACK),5,main] Removed Queue (0)

messages Event[type=CONFIG, [EMAIL PROTECTED]



Regards,

Adrian

----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-23 18:04

Message:
Logged In: YES 
user_id=9459

Correction is does reach UDP. 



Regards,

Adrian

----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-23 17:59

Message:
Logged In: YES 
user_id=9459

I've librarly added System.out.printlns to the jgroups source.

Attaced is test.txt from a run where one works and the other

does not.

I don't see the Config event reaching UDP when it fails,

it doesn't invoke setAdditionalData.



Regards,

Adrian

----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-23 16:50

Message:
Logged In: YES 
user_id=9459

Yes, you have to make the same change to AddDataTest.java



Regards,

Adrian

----------------------------------------------------------------------

Comment By: Sacha Labourey (slaboure)
Date: 2003-09-23 16:46

Message:
Logged In: YES 
user_id=95900

Hello Adrian,



What do you mean by "You have to swap the pbcast 

elements "? To make jboss start I had to swap STABLE and 

NAKACK otherwise I get an ordering exception from JG.

----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-23 16:42

Message:
Logged In: YES 
user_id=9459

Sacha has upgraded to jgroups but I'm still seeing the problem.



Tested with JBoss but also see the problem with the attached

test.

You have to swap the pbcast elements in the attached test

to make it run.



Did Sacha get the correct version?

Do you need anything from me to debug it?



Regards,

Adrian

----------------------------------------------------------------------

Comment By: Bela Ban (belaban)
Date: 2003-09-23 00:46

Message:
Logged In: YES 
user_id=34890

I fixed this in JGroups CVS head. I can backport it to any

JavaGroups version if you want me to.



Otherwise, the change is in ProtocolStack.start() and Queue.



Sacha, Adrian ?

----------------------------------------------------------------------

Comment By: Bela Ban (belaban)
Date: 2003-09-22 18:25

Message:
Logged In: YES 
user_id=34890

I know what the problem is, working on a solution.



A simple quick fix is to add a timeout (say 500ms) after

sending down the CONFIG event, and before connecting.



Bela

----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-22 13:31

Message:
Logged In: YES 
user_id=9459

It is definitly a race condition.

I add some System.out.printlns to the javagroups source

(the jboss tagged version).

I even got it to fail with the default configuration



<DefaultConfiguration>



checking for additional data Thread[main,5,main]

handling config event [EMAIL PROTECTED]

Thread[DownHandler (UDP),5,main]



-------------------------------------------------------

GMS: address is htimes2:34016

-------------------------------------------------------

address is htimes2:34016





<JBossConfiguration>



UDP(mcast_addr=228.1.2.3;mcast_port=45566;ip_ttl=64;ip_mcast=true;mcast_send_buf_size=150000;mcast_recv_buf_size=80000;ucast_send_buf_size=150000;ucast_recv_buf_size=80000;loopback=false):PING(timeout=2000;num_initial_members=3;up_thread=true;down_thread=true):MERGE2(min_interval=5000;max_interval=10000):FD(shun=true;up_thread=true;down_thread=true;timeout=2500;max_tries=5):VERIFY_SUSPECT(timeout=3000;num_msgs=3;up_thread=true;down_thread=true):pbcast.STABLE(desired_avg_gossip=20000;up_thread=true;down_thread=true):pbcast.NAKACK(gc_lag=50;retransmit_timeout=300,600,1200,2400,4800;up_thread=true;down_thread=true):UNICAST(timeout=5000;window_size=100;min_threshold=10;down_thread=true):FRAG(frag_size=8192;down_thread=true;up_thread=true):pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=true;print_local_addr=true):pbcast.STATE_TRANSFER(up_thread=true;down_thread=true)

checking for additional data Thread[main,5,main]



-------------------------------------------------------

GMS: address is htimes2:34018

-------------------------------------------------------

handling config event [EMAIL PROTECTED]

Thread[DownHandler (UDP),5,main]

address is htimes2:34018



----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-22 13:01

Message:
Logged In: YES 
user_id=9459

Hi Bela,



Your test works for me, however using the JBoss protocol stack

it fails. See attached test.



Regards,

Adrian

----------------------------------------------------------------------

Comment By: Sacha Labourey (slaboure)
Date: 2003-09-19 06:28

Message:
Logged In: YES 
user_id=95900

Hello Bela,



No, I don't agree, take a look at the code starting at 

ClusterPartition.startService() i.e. line 282:



        java.util.HashMap staticNodeName = new 

java.util.HashMap();

        staticNodeName.put("additional_data", 

this.nodeName.getBytes());

        this.channel.down (new Event(Event.CONFIG, 

staticNodeName));

        

        channel.connect(partitionName);



This is exactly what you do in your sample code. Would it be 

possible that you have a race condition and that the connect 

occurs while the channel.down call is not yet finished? Is it 

handle by a single thread (the calling thread) or is it 

asynchronous?



It seems that the this.channel.down call does NOT SEND the 

event down the stack but INSTEAD queue it and waits for the 

channel.connect call.



Bela, do you queue your Events waiting for the 

channel.connect call or do you process them asynchronously?

----------------------------------------------------------------------

Comment By: Bela Ban (belaban)
Date: 2003-09-18 22:35

Message:
Logged In: YES 
user_id=34890

Wait, Sacha,



the cluster.log shows clearly that you connect to the

channel *first*, the send down the CONFIG event. This is

incorrect.



Bela

----------------------------------------------------------------------

Comment By: Bela Ban (belaban)
Date: 2003-09-18 22:25

Message:
Logged In: YES 
user_id=34890

Sacha,



are you passing a Map as arg of the CONFIG event ? Also, the

key has to be "additional_data", e.g. as used in the code

below (you can set a breakpoint here to see what's going on):



in UDP.handleDownEvent():



   void handleConfigEvent(HashMap map) {

        if(map == null) return;

        if(map.containsKey("additional_data"))

            additional_data=(byte[])map.get("additional_data");

        if(map.containsKey("send_buf_size")) {

           

mcast_send_buf_size=((Integer)map.get("send_buf_size")).intValue();

            ucast_send_buf_size=mcast_send_buf_size;

        }

        if(map.containsKey("recv_buf_size")) {

           

mcast_recv_buf_size=((Integer)map.get("recv_buf_size")).intValue();

            ucast_recv_buf_size=mcast_recv_buf_size;

        }

        setBufferSizes();

    }



The second thing is that when the member is starting, you

should see the address *with* additional data, ie, the

program below prints the output:





public class AddDataTest {



    public static void main(String[] args) {

        try {

            JChannel c=new JChannel();

            Map m=new HashMap();

            m.put("additional_data", "Node #1".getBytes());

            c.down(new Event(Event.CONFIG, m));

            c.connect("bla");

            System.out.println("address is " +

c.getLocalAddress());

        }

        catch(ChannelException e) {

            e.printStackTrace();

        }

    }



}





-------------------------------------------------------

GMS: address is lotusman:50416 (additional data: 7 bytes)

-------------------------------------------------------

address is lotusman:50416 (additional data: 7 bytes)







----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-18 15:04

Message:
Logged In: YES 
user_id=9459

I still get failures without the default gateway configured.



Attached netstat -an > netstat.txt

----------------------------------------------------------------------

Comment By: Sacha Labourey (slaboure)
Date: 2003-09-18 08:27

Message:
Logged In: YES 
user_id=95900

That is really strange. Bela, any idea on what is happening? 

The thing is that:

 - I create the JChannel

 - send a config event down the stack THAT CONTAINS 

additional data

 - connect the JChannel

 - ask the JChannel for my node name and the view



and my node name does NOT contain the additional data!?!



Here is the excerpt from the log that shows that:



2003-09-17 21:59:33,277 INFO  

[org.jboss.ha.framework.server.ClusterPartition] Starting





=> SL: Here, I've build my "additional data" using some 

information from the currently running server, we have no 

exception and nothing logged, so everything went fine in 

ClusterPartition.generateUniqueNodeName()



=> SL: next cal is JChannel.connect():





2003-09-17 21:59:33,288 DEBUG 

[org.jboss.ha.framework.server.ClusterPartition] Starting 

ClusterPartition: DefaultPartition

2003-09-17 21:59:33,294 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:33 BST 

2003] [INFO] UDP.start(): creating sockets and starting 

threads

2003-09-17 21:59:33,296 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:33 BST 

2003] [INFO] UDP.createSockets(): unicast sockets will use 

interface 192.168.0.51

2003-09-17 21:59:33,305 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:33 BST 

2003] [INFO] UDP.createSockets(): socket information:

local_addr=htimes2:34107, mcast_addr=228.1.2.3:45566, 

bind_addr=/192.168.0.51, ttl=64

send socket: bound to 192.168.0.51:34106, send buffer 

size=65535

receive socket: bound to 192.168.0.51:34107, receive buffer 

size=65535

multicast socket: bound to 192.168.0.51:45566, send buffer 

size=65535, receive buffer size=65535

2003-09-17 21:59:33,312 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:33 BST 

2003] [INFO] UDP.startThreads(): created unicast receiver 

thread

2003-09-17 21:59:33,431 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:33 BST 

2003] [INFO] FRAG.down(): received CONFIG event: 

[EMAIL PROTECTED]

2003-09-17 21:59:33,432 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:33 BST 

2003] [INFO] NAKACK.down(): received CONFIG event: 

[EMAIL PROTECTED]

2003-09-17 21:59:33,434 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:33 BST 

2003] [INFO] UDP.down(): received CONFIG event: 

[EMAIL PROTECTED]





=> SL: Here we can see that the UDP layer (above) has some 

additional_data received by the CONFIG event: [EMAIL PROTECTED]





2003-09-17 21:59:33,445 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:33 BST 

2003] [INFO] PING.down(): FIND_INITIAL_MBRS

2003-09-17 21:59:35,459 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [INFO] PING.down(): initial mbrs are []

2003-09-17 21:59:35,461 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [DEBUG] ClientGmsImpl.join(): initial_mbrs are []

2003-09-17 21:59:35,461 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [INFO] ClientGmsImpl.join(): no initial members 

discovered: creating group as first member

2003-09-17 21:59:35,463 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [INFO] GMS.installView(): view is [htimes2:34107|0] 

[htimes2:34107]

2003-09-17 21:59:35,479 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [INFO] GMS.setImpl(): changed role to 

org.javagroups.protocols.pbcast.CoordGmsImpl

2003-09-17 21:59:35,479 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [INFO] GMS.becomeCoordinator(): htimes2:34107 

became coordinator

2003-09-17 21:59:35,480 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [INFO] ClientGmsImpl.becomeSingletonMember(): 

created group (first member). My view is [htimes2:34107|0], 

impl is org.javagroups.protocols.pbcast.CoordGmsImpl

2003-09-17 21:59:35,605 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [INFO] STABLE.startStableTask(): stable task started; 

num_gossip_runs=3, max_gossip_runs=3

2003-09-17 21:59:35,627 TRACE 

[org.javagroups.DefaultPartition] [Wed Sep 17 21:59:35 BST 

2003] [INFO] MERGE2.FindSubgroups.run(): merge task started

2003-09-17 21:59:35,834 DEBUG 

[org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition

] ViewAccepted: initial members set

2003-09-17 21:59:35,835 INFO  

[org.jboss.ha.framework.server.ClusterPartition] Starting 

channel



=> SL: at this point, the call to JChannel.connect() is finished 

and we start the HAPartition which will call 

JChannel.getLocalAddress() and the recieved JG IPAddress 

object does NOT contain the additional_data: WHY?





2003-09-17 21:59:35,836 DEBUG 

[org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition

] get nodeName

2003-09-17 21:59:35,836 DEBUG 

[org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition

] Get current members

2003-09-17 21:59:35,837 INFO  

[org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition

] Number of cluster members: 1

2003-09-17 21:59:35,837 INFO  

[org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition

] Other members: 0

2003-09-17 21:59:35,837 WARN  

[org.jboss.ha.framework.interfaces.HAPartition.DefaultPartition

] No additional information has been found in the JavaGroup 

address: make sure you are running with a correct version of 

JavaGroups and that the protocol  you are using supports 

the 'additionalData' behaviour





Any idea Bela?

----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-17 23:51

Message:
Logged In: YES 
user_id=9459

cat /etc/hosts

# Do not remove the following line, or various programs

# that require network functionality will fail.

192.168.0.51    htimes2

127.0.0.1       localhost.localdomain   localhost

127.0.0.1       4pcvs



There are no other instances of jboss on the network.

Sorry can't provide netstat, I'm no longer running the

configuration.



One thing occurs to me (I'm configured with a default gateway)

maybe the gateway is interfering with udp? I'll try it again

without the default gateway setting.



Regards,

Adrian

----------------------------------------------------------------------

Comment By: Scott M Stark (starksm)
Date: 2003-09-17 23:37

Message:
Logged In: YES 
user_id=175228

I have an RH9 system with JDK 1.4.2 that I have tried this

recepie on a few times and don't see the problem. One issue

with the "...  'additionalData'' warning is that if this

occurs you are guarenteed to see an NPE so we should just be

throwing an exception rather than logging a msg and then

proceeding to an NPE.



Do you have another cluster running somewhere else? I'll

have to look into the logs.



----------------------------------------------------------------------

Comment By: Adrian Brock (ejort)
Date: 2003-09-17 21:23

Message:
Logged In: YES 
user_id=9459

To construct my two nodes:



cp -r all all2

edit all/conf/jboss-service.xml (uncomment service-binding)

./run.sh -c all and ./run.sh -c all2



Regards,

Adrian

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=376685&aid=808170&group_id=22866


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
JBoss-Development mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/jboss-development

Reply via email to