Hello all,
I am having intermittent trouble with JMS provider failover.  I am setting up a 
number of JBoss instances (4.0.1sp1), each running the 'all' configuration, on 
separate boxes.  They are each being deployed under the default partition.  My 
goal is to have a uniform server configuration and provided JMS failover.
I have been running a failover test in production, where we have 50 
boxes/nodes,  where I will bring down each of the boxes in groups of 10-20 and 
look to make sure that the master node switches correctly and each of the 
instances are able to rejoin the cluster and successfully re-deploy each of the 
mdbs.  
*Sometimes* all the boxes go down and come back fine (I see a bunch of errors 
about not being able to contact the JMS provider once the master node goes down 
but I understand that this is to be changed to a debugging level.  The errors 
go away once the next node in line becomes the coordinator).  Othertimes I will 
continually get errors such as:
org.jboss.ejb.plugins.jms.DLQHandler - Initialization failed DLQHandler
javax.jms.JMSException: Error creating the dlq connection: XAConnectionFactory 
not bound
then for each of my MDB's I get
2006-06-20 10:37:29,556 ERROR- [JMSContainerInvoker(MyMDB) Reconnect] 
org.jboss.ejb.plugins.jms.DLQHandler - Initialization failed DLQHandler
javax.jms.JMSException: Error creating the dlq connection: XAConnectionFactory 
not bound
2006-06-20 10:37:29,556 WARN - [JMSContainerInvoker(MyMDB) Reconnect] 
org.jboss.ejb.plugins.jms.JMSContainerInvoker - JMS provider failure detected:
javax.jms.JMSException: Error creating the dlq connection: XAConnectionFactory 
not bound

and all throughout the start up process I see these scattered throughout 
server.xml:
ERROR- [UpHandler (GMS)] org.jgroups.protocols.pbcast.ClientGmsImpl - suspect() 
should not be invoked on an instance of 
org.jgroups.protocols.pbcast.ClientGmsImpl
and 
WARN - [DownHandler (GMS)] org.jgroups.protocols.pbcast.ClientGmsImpl - 
handleJoin(fsf-pw148:49881 (additional data: 16 bytes)) failed, retrying
and
WARN - [UpHandler (NAKACK)] org.jgroups.protocols.pbcast.NAKACK - [xxxx:49881 
(additional data: 16 bytes)] discarded message from non-member yyyy:37995 
(additional data: 16 bytes)

In these cases it is like the next node in line does not recognize that it is 
supposed to become the coordinator (I never see the logs that it is deploying 
the destinations).  I am able to sometimes remedy this by successively taking 
down the node that was supposed become the corrdinator until one finally does.

I have applied the fix to hajndi-jms-ds.xml to avoid looking for 
XAConnectionFactory in the local jvm (remove the java:). I have also moved off 
of hsqldb as the jms datasource.  Could this be related to the number of nodes 
in the cluster?  Would it help to switch from FD to FD_SOCK failure detection?  
If so, would it still provide me with reliable JMS failover?

Any insight would be greatly appreciated.  If any additional info is needed, 
please let me know.

View the original post : 
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3952117#3952117

Reply to the post : 
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3952117


_______________________________________________
JBoss-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jboss-user

Reply via email to