Well, I was planning to make a general "going to production, anything I 
should tune in resin for prime time" post this morning, but turns out I 
have a crash related . We had our first outage related (as far as I can 
tell) to resin.  (our site actually became available publicly outside of 
our beta group on Tuesday).

Both of the servers in our cluster stopped responding at the same time 
and java started using 100% of all CPU resources. Upon killing one 
server, the other began responding almost immediately.  Restarting the 
dead server resulting in MANY exceptions (all roughly the same):

[09:12:53.567] java.lang.IllegalStateException: Can't yet support data over 64M
[09:12:53.567]  at com.caucho.db.store.Inode.readFragmentAddr(Inode.java:972)
[09:12:53.567]  at com.caucho.db.store.Inode.remove(Inode.java:832)
[09:12:53.567]  at 
com.caucho.db.store.Transaction.writeData(Transaction.java:568)
[09:12:53.567]  at com.caucho.db.sql.QueryContext.unlock(QueryContext.java:517)
[09:12:53.567]  at 
com.caucho.db.sql.RowIterateExpr.nextBlock(RowIterateExpr.java:86)
[09:12:53.567]  at com.caucho.db.sql.Query.nextBlock(Query.java:713)
[09:12:53.567]  at com.caucho.db.sql.Query.nextTuple(Query.java:690)
[09:12:53.567]  at com.caucho.db.sql.DeleteQuery.execute(DeleteQuery.java:81)
[09:12:53.567]  at 
com.caucho.db.jdbc.PreparedStatementImpl.execute(PreparedStatementImpl.java:345)
[09:12:53.567]  at 
com.caucho.db.jdbc.PreparedStatementImpl.executeUpdate(PreparedStatementImpl.java:313)
[09:12:53.567]  at 
com.caucho.server.cluster.FileBacking.clearOldObjects(FileBacking.java:260)
[09:12:53.567]  at 
com.caucho.server.cluster.ClusterStore.clearOldObjects(ClusterStore.java:358)
[09:12:53.567]  at 
com.caucho.server.cluster.StoreManager.handleAlarm(StoreManager.java:637)
[09:12:53.567]  at 
com.caucho.server.cluster.StoreManager.start(StoreManager.java:386)
[09:12:53.567]  at 
com.caucho.server.cluster.ClusterStore.start(ClusterStore.java:196)
[09:12:53.567]  at 
com.caucho.server.cluster.Cluster.environmentStart(Cluster.java:928)
[09:12:53.567]  at 
com.caucho.loader.EnvironmentClassLoader.start(EnvironmentClassLoader.java:475)
[09:12:53.567]  at com.caucho.server.cluster.Server.start(Server.java:1149)
[09:12:53.567]  at 
com.caucho.server.cluster.Cluster.startServer(Cluster.java:719)
[09:12:53.567]  at 
com.caucho.server.cluster.ClusterServer.startServer(ClusterServer.java:455)
[09:12:53.567]  at com.caucho.server.resin.Resin.start(Resin.java:694)
[09:12:53.567]  at com.caucho.server.resin.Resin.initMain(Resin.java:1114)
[09:12:53.567]  at com.caucho.server.resin.Resin.main(Resin.java:1316)

This exception appeared many time, but everything appears to be working 
again.  I found one reference related to this being cluster store 
corruption possibly related to locking issues. Since our fence came down 
(allowing public access, vs beta group only access) spiders have been 
hitting our site pretty hard which could result in a lot more lock 
contention (several 1000 hits on a server in rapid succession). Not sure 
if this might be related.

Any idea what the root cause of this hang up was, or what I can do to 
prevent it in the future?

I'm running resin-pro 3.1.3 with a license. I can't upgrade to 3.1.4 or 
3.1.5 due to the smarty regular expression issue in them (maybe it is 
fixed in the current snapshot???)

Any help/suggestions are much appreciated.

Andrew


_______________________________________________
resin-interest mailing list
resin-interest@caucho.com
http://maillist.caucho.com/mailman/listinfo/resin-interest

Reply via email to