Well, I was planning to make a general "going to production, anything I should tune in resin for prime time" post this morning, but turns out I have a crash related . We had our first outage related (as far as I can tell) to resin. (our site actually became available publicly outside of our beta group on Tuesday).
Both of the servers in our cluster stopped responding at the same time and java started using 100% of all CPU resources. Upon killing one server, the other began responding almost immediately. Restarting the dead server resulting in MANY exceptions (all roughly the same): [09:12:53.567] java.lang.IllegalStateException: Can't yet support data over 64M [09:12:53.567] at com.caucho.db.store.Inode.readFragmentAddr(Inode.java:972) [09:12:53.567] at com.caucho.db.store.Inode.remove(Inode.java:832) [09:12:53.567] at com.caucho.db.store.Transaction.writeData(Transaction.java:568) [09:12:53.567] at com.caucho.db.sql.QueryContext.unlock(QueryContext.java:517) [09:12:53.567] at com.caucho.db.sql.RowIterateExpr.nextBlock(RowIterateExpr.java:86) [09:12:53.567] at com.caucho.db.sql.Query.nextBlock(Query.java:713) [09:12:53.567] at com.caucho.db.sql.Query.nextTuple(Query.java:690) [09:12:53.567] at com.caucho.db.sql.DeleteQuery.execute(DeleteQuery.java:81) [09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.execute(PreparedStatementImpl.java:345) [09:12:53.567] at com.caucho.db.jdbc.PreparedStatementImpl.executeUpdate(PreparedStatementImpl.java:313) [09:12:53.567] at com.caucho.server.cluster.FileBacking.clearOldObjects(FileBacking.java:260) [09:12:53.567] at com.caucho.server.cluster.ClusterStore.clearOldObjects(ClusterStore.java:358) [09:12:53.567] at com.caucho.server.cluster.StoreManager.handleAlarm(StoreManager.java:637) [09:12:53.567] at com.caucho.server.cluster.StoreManager.start(StoreManager.java:386) [09:12:53.567] at com.caucho.server.cluster.ClusterStore.start(ClusterStore.java:196) [09:12:53.567] at com.caucho.server.cluster.Cluster.environmentStart(Cluster.java:928) [09:12:53.567] at com.caucho.loader.EnvironmentClassLoader.start(EnvironmentClassLoader.java:475) [09:12:53.567] at com.caucho.server.cluster.Server.start(Server.java:1149) [09:12:53.567] at com.caucho.server.cluster.Cluster.startServer(Cluster.java:719) [09:12:53.567] at com.caucho.server.cluster.ClusterServer.startServer(ClusterServer.java:455) [09:12:53.567] at com.caucho.server.resin.Resin.start(Resin.java:694) [09:12:53.567] at com.caucho.server.resin.Resin.initMain(Resin.java:1114) [09:12:53.567] at com.caucho.server.resin.Resin.main(Resin.java:1316) This exception appeared many time, but everything appears to be working again. I found one reference related to this being cluster store corruption possibly related to locking issues. Since our fence came down (allowing public access, vs beta group only access) spiders have been hitting our site pretty hard which could result in a lot more lock contention (several 1000 hits on a server in rapid succession). Not sure if this might be related. Any idea what the root cause of this hang up was, or what I can do to prevent it in the future? I'm running resin-pro 3.1.3 with a license. I can't upgrade to 3.1.4 or 3.1.5 due to the smarty regular expression issue in them (maybe it is fixed in the current snapshot???) Any help/suggestions are much appreciated. Andrew _______________________________________________ resin-interest mailing list resin-interest@caucho.com http://maillist.caucho.com/mailman/listinfo/resin-interest