After putting session replication in for CAS 3.3.5 via jBoss we just recently
started to notice odd behavior in our logs. I'll describe the behavior and then
list some ideas we've had and would like some input on these ideas...
Behavior:
In our tomcat logs, we see the following happen somewhat frequently:
Mar 11, 2010 4:08:19 AM org.apache.catalina.cluster.tcp.SimpleTcpCluster
memberDisappeared
INFO: Received member
disappeared:org.apache.catalina.cluster.mcast.McastMember[tcp://XXX.XXX.XXX.XXX:4001,catalina,XXX.XXX.XXX.XXX,4001,
alive=125032780]
Mar 11, 2010 4:08:19 AM org.apache.catalina.cluster.tcp.SimpleTcpCluster
memberAdded
INFO: Replication member
added:org.apache.catalina.cluster.mcast.McastMember[tcp://XXX.XXX.XXX.XXX:4001,catalina,XXX.XXX.XXX.XXX,4001,
alive=125035743]
This has been gonig on for quite some time now, and happens semi-consistently.
Sometimes once every 10-30 minutes and sometimes 6-10 times in succession
within a few seconds. We don't think session replication is technically
failing, but rather "resetting" itself after the mcast fails. We had the
clustering setup in tomcat 5.5.27 to be fairly standard:
<Cluster className="org.apache.catalina.cluster.tcp.SimpleTcpCluster"
managerClassName="org.apache.catalina.cluster.session.DeltaManager"
expireSessionsOnShutdown="false"
useDirtyFlag="true">
<Membership
className="org.apache.catalina.cluster.mcast.McastService"
mcastAddr="X.X.X.X"
mcastPort="45564"
mcastFrequency="500"
mcastDropTime="3000"
mcastTTL="1"/>
<Receiver
className="org.apache.catalina.cluster.tcp.ReplicationListener"
tcpListenAddress="auto"
tcpListenPort="4001"
tcpSelectorTimeout="100"
tcpThreadCount="6"/>
<Sender
className="org.apache.catalina.cluster.tcp.ReplicationTransmitter"
replicationMode="synchronous"/>
<Valve className="org.apache.catalina.cluster.tcp.ReplicationValve"
filter=".*\.gif;.*\.js;.*\.jpg;.*\.png;.*\.htm;.*\.html;.*\.css;.*\.txt;"/>
<Deployer
className="org.apache.catalina.cluster.deploy.FarmWarDeployer"
tempDir="/tmp/war-temp/"
deployDir="/tmp/war-deploy/"
watchDir="/tmp/war-listen/"
watchEnabled="false"/>
<ClusterListener
className="org.apache.catalina.cluster.session.ClusterSessionListener"/>
</Cluster>
Also noticed was another odd set of logs:
Mar 12, 2010 2:36:45 PM org.apache.catalina.cluster.tcp.TcpReplicationThread
sendAck
WARNING: Unable to send ACK back through channel, channel disconnected?:
Connection reset by peer
Mar 12, 2010 2:36:45 PM org.apache.catalina.cluster.tcp.TcpReplicationThread
sendAck
WARNING: Unable to send ACK back through channel, channel disconnected?: Broken
pipe
This would happen maybe once or twice a day. This one is a bit trickier to
figure out the cause of.
The conclusion was that the mcast timeouts were most likely a little low, and
that since there were no other errors... come Monday we'd just up the timeout
values to:
mcastFrequency="2000"
mcastDropTime="10000"
But this morning we also noticed (before changing these timeout values) that
the logging had changed a bit. The "memberDisappeared" & "memberAdded" logs
seemed to slide off and were replaced with another type of log which was a bit
more consistent (happening upwards of 10 minute intervals) This error makes me
somewhat concerned:
Mar 15, 2010 3:59:43 AM org.apache.catalina.cluster.tcp.ReplicationValve
sendReplicationMessage
SEVERE: Unable to perform replication request.
java.util.ConcurrentModificationException
at java.util.ArrayList.writeObject(ArrayList.java:573)
at sun.reflect.GeneratedMethodAccessor71.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:945)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1461)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1509)
at
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1474)
at
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1392)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1150)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:326)
at
org.apache.catalina.cluster.session.DeltaRequest$AttributeInfo.writeExternal(DeltaRequest.java:352)
at
org.apache.catalina.cluster.session.DeltaRequest.writeExternal(DeltaRequest.java:269)
at
org.apache.catalina.cluster.session.DeltaManager.unloadDeltaRequest(DeltaManager.java:743)
at
org.apache.catalina.cluster.session.DeltaManager.requestCompleted(DeltaManager.java:1320)
at
org.apache.catalina.cluster.tcp.ReplicationValve.send(ReplicationValve.java:549)
at
org.apache.catalina.cluster.tcp.ReplicationValve.sendMessage(ReplicationValve.java:536)
at
org.apache.catalina.cluster.tcp.ReplicationValve.sendSessionReplicationMessage(ReplicationValve.java:518)
at
org.apache.catalina.cluster.tcp.ReplicationValve.sendReplicationMessage(ReplicationValve.java:428)
at
org.apache.catalina.cluster.tcp.ReplicationValve.invoke(ReplicationValve.java:362)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174)
at org.apache.jk.server.JkCoyoteHandler.invoke(JkCoyoteHandler.java:200)
at org.apache.jk.common.HandlerRequest.invoke(HandlerRequest.java:283)
at org.apache.jk.common.ChannelSocket.invoke(ChannelSocket.java:773)
at
org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:703)
at
org.apache.jk.common.ChannelSocket$SocketConnection.runIt(ChannelSocket.java:895)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
at java.lang.Thread.run(Thread.java:619)
After this message, we usually get a replication stats message within a minute
later:
ie.
Mar 15, 2010 1:48:47 PM org.apache.catalina.cluster.tcp.DataSender addStats
INFO: Send stats from [XXX.XXX.XXX.XXX:4,001], Nr of bytes sent=52,364,561 over
14,000 = 3,740 bytes/request, processing time 0 msec, avg processing time 0 msec
Which I assume tells us that replication is up and going along fine.
We've upped the mcast timeout values on our servers, but are concerned that
replication is actually failing now. Not sure why this started this morning.
Going to do more network oriented research on this for certain.
Ideas:
I believe we should not encounter further issue with member connect/disconnect
due to the timeout value mod. We are assuming that we do not see similar
logging in our test env due to the lack of traffic on the test subnet vs a much
higher traffic volume in our production subnet, thus possibly affecting
mcast-ing.
Aside from going to tomcat 6.x to achieve static membership (as opposed to
mcast membership) I'm not sure what else could be done to possibly resolve the
"SEVERE: Unable to perform replication request." logs we're getting. I also
haven't been unable to determine the severity of this message either.
Any ideas or information?
Raymond Walker
Software Systems Engineer Sr.
ITS Northern Arizona University
--
You are currently subscribed to [email protected] as:
[email protected]
To unsubscribe, change settings or access archives, see
http://www.ja-sig.org/wiki/display/JSG/cas-user