Hello,

First, and perhaps completely unrelated: Is it normal to see messages such as 
"additional data: 19 bytes" throughout the logs?

Moving on...

I have a two-node cluster of stock 4.0.5GA servers.  After roughly 4 hours of 
operation one node will fail with an OutOfMemoryError stemming from 
org.jgroups.protocols.UDP.  Both servers have two eth interfaces, so I set 
bind_addr on the UDP element accordingly in cluster-service.xml and 
jboss-service.xml in the tc5-cluster sar.

I enabled DEBUG for jgroups.  It seems to get pretty messy.  First, node2 stops 
ack'ing on are-you-alive messages.  Then node1 gets susptected, but for no 
apparent reason.  If I understand correctly, node1 is the coord, so node2 can't 
remove it and it will refuse to remove itself from the view.  It may, however, 
opt to leave and rejoin.

Below is an excerpt from the cluster log file from around the time things begin 
to go awry.  Any hints are greatly appreciated.


  | 2007-04-25 15:33:07,237 DEBUG [org.jgroups.protocols.FD] sending 
are-you-alive msg to node2:32802 (own address=node1:32839)
  | 2007-04-25 15:33:07,269 DEBUG [org.jgroups.protocols.UDP] 
  | sending msgs:
  | node2:32802: 1 msgs
  | 
  | 2007-04-25 15:33:07,284 DEBUG [org.jgroups.protocols.FD] received ack from 
node2:32802
  | 2007-04-25 15:33:07,316 DEBUG [org.jgroups.protocols.UDP] 
  | sending msgs:
  | node2:32802: 1 msgs
  | 
  | 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending 
are-you-alive msg to node2:32802 (own address=node1:32839)
  | 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing 
from node2:32802 (number=0)
  | 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] sending 
are-you-alive msg to node2:32805 (additional data: 19 bytes) (own 
address=node1:32842 (addit
  | ional data: 19 bytes))
  | 2007-04-25 15:34:51,762 DEBUG [org.jgroups.protocols.FD] heartbeat missing 
from node2:32805 (additional data: 19 bytes) (number=0)
  | 2007-04-25 15:34:51,767 DEBUG [org.jgroups.protocols.FD] [SUSPECT] suspect 
hdr is [FD: SUSPECT (suspected_mbrs=[node1:32842 (additional data: 19 bytes)], 
fro
  | m=node2:32805 (additional data: 19 bytes))]
  | 2007-04-25 15:34:51,767 WARN  [org.jgroups.protocols.FD] I was suspected, 
but will not remove myself from membership (waiting for EXIT message)
  | 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.STABLE] stable 
task started; num_gossip_runs=3, max_gossip_runs=3
  | 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.CoordGmsImpl] 
view=[node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 
19 
  | bytes)]
  | 2007-04-25 15:34:51,768 DEBUG [org.jgroups.protocols.pbcast.GMS] 
[local_addr=node1:32842 (additional data: 19 bytes)] view is [node2:32805 
(additional data: 
  | 19 bytes)|2] [node2:32805 (additional data: 19 bytes)]
  | 2007-04-25 15:34:51,780 WARN  [org.jgroups.protocols.pbcast.GMS] 
checkSelfInclusion() failed, node1:32842 (additional data: 19 bytes) is not a 
member of view
  |  [node2:32805 (additional data: 19 bytes)|2] [node2:32805 (additional data: 
19 bytes)]; discarding view
  | 2007-04-25 15:34:51,781 WARN  [org.jgroups.protocols.pbcast.GMS] I 
(node1:32842 (additional data: 19 bytes)) am being shunned, will leave and 
rejoin group (p
  | rev_members are [node1:32842 (additional data: 19 bytes) node2:32805 
(additional data: 19 bytes) ])
  | 2007-04-25 15:34:51,781 INFO  [org.jgroups.JChannel] received an EXIT 
event, will leave the channel
  | 2007-04-25 15:34:51,783 INFO  [org.jgroups.JChannel] closing the channel
  | 2007-04-25 15:34:51,786 ERROR [org.jgroups.protocols.UDP] [node1:32842 
(additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap 
allocation fai
  | led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
  |         at java.net.PlainDatagramSocketImpl.receive0(Native Method)
  |         at 
java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
  |         at java.net.DatagramSocket.receive(DatagramSocket.java:724)
  |         at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
  |         at java.lang.Thread.run(Thread.java:799)
  | 
  | 2007-04-25 15:34:51,790 ERROR [org.jgroups.protocols.UDP] [node1:32839] 
exception=java.lang.OutOfMemoryError: heap allocation failed, stack 
trace=java.lang.O
  | utOfMemoryError: heap allocation failed
  |         at java.net.PlainDatagramSocketImpl.receive0(Native Method)
  |         at 
java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
  |         at java.net.DatagramSocket.receive(DatagramSocket.java:724)
  |         at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
  |         at java.lang.Thread.run(Thread.java:799)
  | 
  | 2007-04-25 15:34:51,795 DEBUG [org.jgroups.protocols.pbcast.NAKACK] 
contents for node1:32842 (additional data: 19 bytes):
  | 
  | sent_msgs: [6837 - 6890]
  | received_msgs:
  | node2:32805 (additional data: 19 bytes): received_msgs: [], delivered_msgs: 
[276 - 328]
  | node1:32842 (additional data: 19 bytes): received_msgs: [], delivered_msgs: 
[6838 - 6890]
  | 
  | 2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] socket to 
node2:32805 (additional data: 19 bytes) was reset
  | 2007-04-25 15:34:51,796 DEBUG [org.jgroups.protocols.FD_SOCK] pinger thread 
terminated
  | 2007-04-25 15:34:51,825 DEBUG [org.jgroups.protocols.UDP] 
  | sending msgs:
  | node1:32839: 1 msgs
  | 
  | 2007-04-25 15:34:52,092 ERROR [org.jgroups.protocols.UDP] [node1:32842 
(additional data: 19 bytes)] exception=java.lang.OutOfMemoryError: heap 
allocation fai
  | led, stack trace=java.lang.OutOfMemoryError: heap allocation failed
  |         at java.net.PlainDatagramSocketImpl.receive0(Native Method)
  |         at 
java.net.PlainDatagramSocketImpl.receive(PlainDatagramSocketImpl.java:181)
  |         at java.net.DatagramSocket.receive(DatagramSocket.java:724)
  |         at org.jgroups.protocols.UDP$UcastReceiver.run(UDP.java:1264)
  |         at java.lang.Thread.run(Thread.java:799)
  | 

View the original post : 
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=4041136#4041136

Reply to the post : 
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=4041136
_______________________________________________
jboss-user mailing list
[email protected]
https://lists.jboss.org/mailman/listinfo/jboss-user

Reply via email to