I was able to recreate the problem on our cluster. We use the 
DistributedLockManager of the JGroups vers. 2.2.8. Just put it under stress 
testing I was able to recreate the problem consistently.
The root of the problem is the JGroup fails to receive heartbeat messages from 
the member under stress testing and it remove the channel from the group. The 
channel on the stress node is then closed and unusable for sending or receiving 
messages.
To prevent the problem from happening I use following parameters and It works 
out OK

FD ( <FD timeout="5000" max_tries="4" shun=true/> ) 
.................
 <pbcast.GMS ....... shun="false"/>


Since we have access to the channel creation code, we also set the channel 
option AUTO_RECONNECT after create it. I don't know if this is the case for 
your situation

channel.setOpt(Channel.AUTO_RECONNECT, Boolean.TRUE)

to allow channel to restart and function again if the problem does happen.

Hope It provides some help.

HN

View the original post : 
http://www.jboss.com/index.html?module=bb&op=viewtopic&p=3902667#3902667

Reply to the post : 
http://www.jboss.com/index.html?module=bb&op=posting&mode=reply&p=3902667


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
JBoss-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/jboss-user

Reply via email to