Thank you for your quick reply!

We have a cluster of 2 nodes, and have killed/started 1 node a couple of times. 
We did the following scenario:

1. Start node 1
2. Start node 2 after node 1 has started completely
3. Once node 2 has started completely, 'kill -9' node 2 (first failover)
4. Start node 2
5. Once node 2 has started completely, 'kill -9' node 2 (second failover)
6. etc.

At the moment we have no JMS connections or sessions at either node, because 
we're testing the failover scenario 'pure' without any additional load.
Please note that the JMS functionality as such works very fine from the client 
perspective in the clustered setup (sending, receiving, reconnecting once a 
node a gone, etc.). Our problem at the moment only concerns the messaging-node 
failover with the mentioned thread-increase.

After killing node2 several times (depending on the amount of available OS 
memory) we get the following exception (stack trace varies, but all are 
OOME-native-thread):


  | 2008-03-26 10:23:03,967 ERROR [org.jgroups] uncaught exception in 
Thread[ServerSocket acceptor thread 
(channel=DefaultPartition-EntityCache),5,JGroups threa
  | ds] (thread group=org.jgroups.util.Util$1[name=JGroups threads,maxpri=10] )
  | java.lang.OutOfMemoryError: unable to create new native thread
  |         at java.lang.Thread.start0(Native Method)
  |         at java.lang.Thread.start(Thread.java:574)
  |         at 
org.jgroups.protocols.FD_SOCK$ServerSocketHandler.run(FD_SOCK.java:1089)
  |         at java.lang.Thread.run(Thread.java:595)
  | 

Here's an excerpt from our log from node 1 from the moment where it has 
started, until when the failover has finished:


  | 2008-03-26 10:01:46,914 INFO  [org.jboss.system.server.Server] JBoss (MX 
MicroKernel) [4.2.2.GA (build: SVNTag=JBoss_4_2_2_GA date=200710221139)] 
Started in 59s:380ms
  | 2008-03-26 10:02:45,989 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45958|1] [192.168.86.201:45958, 192.168.86.202:45969]
  | 2008-03-26 10:02:46,041 INFO  
[org.jboss.cache.statetransfer.StateTransferGenerator_140] returning the state 
for tree rooted in /(1024 bytes)
  | 2008-03-26 10:02:53,732 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster 
view for partition devel (id: 1, delta: 1) : [192.168.86.201:1099, 
192.168.86.202:1099]
  | 2008-03-26 10:02:53,734 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am 
(192.168.86.201:1099) received membershipChanged event:
  | 2008-03-26 10:02:53,734 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead 
members: 0 ([])
  | 2008-03-26 10:02:53,734 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New 
Members : 1 ([192.168.86.202:1099])
  | 2008-03-26 10:02:53,734 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All 
Members : 2 ([192.168.86.201:1099, 192.168.86.202:1099])
  | 2008-03-26 10:02:56,742 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45964|1] [192.168.86.201:45964, 192.168.86.202:45972]
  | 2008-03-26 10:02:59,322 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45966|1] [192.168.86.201:45966, 192.168.86.202:45974]
  | 2008-03-26 10:03:35,464 WARN  [org.jgroups.blocks.ConnectionTable] peer 
closed connection, trying to re-send msg
  | 2008-03-26 10:03:35,464 ERROR [org.jgroups.blocks.ConnectionTable] 2nd 
attempt to send data failed too
  | 2008-03-26 10:03:35,465 WARN  [org.jgroups.blocks.ConnectionTable] peer 
closed connection, trying to re-send msg
  | 2008-03-26 10:03:35,465 ERROR [org.jgroups.blocks.ConnectionTable] 2nd 
attempt to send data failed too
  | 2008-03-26 10:03:35,466 WARN  [org.jgroups.blocks.ConnectionTable] peer 
closed connection, trying to re-send msg
  | 2008-03-26 10:03:35,466 ERROR [org.jgroups.blocks.ConnectionTable] 2nd 
attempt to send data failed too
  | 2008-03-26 10:03:35,633 ERROR 
[org.jboss.remoting.transport.socket.SocketClientInvoker] Got marshalling 
exception, exiting
  | java.net.SocketException: end of file
  |         at 
org.jboss.remoting.transport.socket.MicroSocketClientInvoker.transport(MicroSocketClientInvoker.java:604)
  |         at 
org.jboss.remoting.transport.bisocket.BisocketClientInvoker.transport(BisocketClientInvoker.java:418)
  |         at 
org.jboss.remoting.MicroRemoteClientInvoker.invoke(MicroRemoteClientInvoker.java:122)
  |         at 
org.jboss.remoting.ConnectionValidator.doCheckConnection(ConnectionValidator.java:133)
  |         at 
org.jboss.remoting.ConnectionValidator.run(ConnectionValidator.java:308)
  |         at java.util.TimerThread.mainLoop(Timer.java:512)
  |         at java.util.TimerThread.run(Timer.java:462)
  | 2008-03-26 10:03:36,972 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] Suspected 
member: 192.168.86.202:8800
  | 2008-03-26 10:03:37,031 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster 
view for partition devel (id: 2, delta: -1) : [192.168.86.201:1099]
  | 2008-03-26 10:03:37,032 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am 
(192.168.86.201:1099) received membershipChanged event:
  | 2008-03-26 10:03:37,032 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead 
members: 1 ([192.168.86.202:1099])
  | 2008-03-26 10:03:37,032 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New 
Members : 0 ([])
  | 2008-03-26 10:03:37,032 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All 
Members : 1 ([192.168.86.201:1099])
  | 2008-03-26 10:03:37,039 INFO  
[org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] JBoss Messaging 
is failing over for failed node 2. If there are many messages to reload this 
may take some time...
  | 2008-03-26 10:03:37,043 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45958|2] [192.168.86.201:45958]
  | 2008-03-26 10:03:37,047 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45964|2] [192.168.86.201:45964]
  | 2008-03-26 10:03:37,048 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45966|2] [192.168.86.201:45966]
  | 2008-03-26 10:03:37,511 WARN  [org.jboss.remoting.LeasePinger] 
LeasePinger[SocketClientInvoker[19985cb, 
bisocket://192.168.86.202:4457](5c4o2e9-na0je7-fe9q826g-1-fe9qb3hb-j)] failed 
to ping to server: Can not get connection to server. Problem establishing 
socket connection for InvokerLocator [bisocket://192.168.86
  | 
.202:4457/?JBM_clientMaxPoolSize=200&clientLeasePeriod=10000&clientSocketClass=org.jboss.jms.client.remoting.ClientSocketWrapper&dataType=jms&
  | 
marshaller=org.jboss.jms.wireformat.JMSWireFormat&numberOfCallRetries=1&numberOfRetries=10&pingFrequency=214748364&pingWindowFactor=10&socket.check_connection=false&timeout=
  | 0&unmarshaller=org.jboss.jms.wireformat.JMSWireFormat]
  | 2008-03-26 10:03:37,985 WARN  [org.jboss.remoting.Client] unable to remove 
remote callback handler: Can not get connection to server. Problem establishing 
socket connection for InvokerLocator 
[bisocket://192.168.86.202:4457/?JBM_clientMaxPoolSize=200&clientLeasePeriod=10000&clientSocketClass=org.jboss.jms.client.r
  | 
emoting.ClientSocketWrapper&dataType=jms&marshaller=org.jboss.jms.wireformat.JMSWireFormat&numberOfCallRetries=1&numberOfRetries=10&pingFrequency=214748364&pingWindowFactor=10&
  | 
socket.check_connection=false&timeout=0&unmarshaller=org.jboss.jms.wireformat.JMSWireFormat]
  | 2008-03-26 10:03:37,985 WARN  [org.jboss.remoting.LeasePinger] 
LeasePinger[SocketClientInvoker[19985cb, 
bisocket://192.168.86.202:4457](5c4o2e9-na0je7-fe9q826g-1-fe9qb3hb-j)] failed 
sending disconnect for client lease for client with session ID 
5c4o2e9-na0je7-fe9q826g-1-fe9qb3ha-h
  | 2008-03-26 10:03:37,986 ERROR [org.jboss.remoting.MicroRemoteClientInvoker] 
error shutting down lease pinger
  | 2008-03-26 10:03:37,990 INFO  
[org.jboss.messaging.core.impl.postoffice.MessagingPostOffice] JBoss Messaging 
failover completed
  | 2008-03-26 10:03:57,382 WARN  
[org.jboss.jms.server.connectionmanager.SimpleConnectionManager] 
ConnectionManager[14790be] cannot look up remoting session ID 
5c4o2e9-lf5156-fe9q9x46-1-fe9qb3cg-e
  | 2008-03-26 10:03:57,382 WARN  
[org.jboss.jms.server.connectionmanager.SimpleConnectionManager] A problem has 
been detected with the connection to remote client 
5c4o2e9-lf5156-fe9q9x46-1-fe9qb3cg-e, jmsClientID=null. It is possible the 
client has exited without closing its connection(s) or the network has failed. 
All connection resources corresponding to that client process will now be 
removed.
  | 

After we start node 2 again, the log on node 1 looks like this:


  | 2008-03-26 10:04:24,333 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45958|3] [192.168.86.201:45958, 192.168.86.202:45976]
  | 2008-03-26 10:04:24,370 INFO  
[org.jboss.cache.statetransfer.StateTransferGenerator_140] returning the state 
for tree rooted in /(1024 bytes)
  | 2008-03-26 10:04:31,502 INFO  
[org.jboss.ha.framework.interfaces.HAPartition.lifecycle.devel] New cluster 
view for partition devel (id: 3, delta: 1) : [192.168.86.201:1099, 
192.168.86.202:1099]
  | 2008-03-26 10:04:31,502 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] I am 
(192.168.86.201:1099) received membershipChanged event:
  | 2008-03-26 10:04:31,502 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] Dead 
members: 0 ([])
  | 2008-03-26 10:04:31,502 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] New 
Members : 1 ([192.168.86.202:1099])
  | 2008-03-26 10:04:31,502 INFO  
[org.jboss.ha.framework.server.DistributedReplicantManagerImpl.devel] All 
Members : 2 ([192.168.86.201:1099, 192.168.86.202:1099])
  | 2008-03-26 10:04:34,472 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45964|3] [192.168.86.201:45964, 192.168.86.202:45979]
  | 2008-03-26 10:04:36,692 INFO  [org.jboss.cache.TreeCache] viewAccepted(): 
[192.168.86.201:45966|3] [192.168.86.201:45966, 192.168.86.202:45981]
  | 

In the meantime, we've configured all our EQs and DLQs as non-clustered, 
reducing the thread-increase to ~75 new threads per failover instead of ~200. 
This allows us to failover a couple of extra times before the OOME pops up, but 
still doesn't solve the real problem...

We've also tried to decrease the initial thread stack size with the -Xss JVM 
parameter, but this doesn't appear to make a difference on our 2.4 kernel wrt 
the maximum number of threads the JVM can start/keep alive.

View the original post : 
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4138923#4138923

Reply to the post : 
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4138923
_______________________________________________
jboss-user mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/jboss-user

Reply via email to