On Mar 27, 2008, at 7:47 AM, Andrew Fritz wrote: > Well, I was planning to make a general "going to production, > anything I > should tune in resin for prime time" post this morning, but turns > out I > have a crash related . We had our first outage related (as far as I > can > tell) to resin. (our site actually became available publicly > outside of > our beta group on Tuesday).
> > > Both of the servers in our cluster stopped responding at the same time > and java started using 100% of all CPU resources. In this kind of situation, a thread dump is always helpful. There's a tab in the /resin-admin which will give you a thread dump (or you can always use the JDK's capabilities.) > Upon killing one > server, the other began responding almost immediately. Restarting the > dead server resulting in MANY exceptions (all roughly the same): > [09:12:53.567] java.lang.IllegalStateException: Can't yet support > data over 64M > [09:12:53.567] at > com.caucho.db.store.Inode.readFragmentAddr(Inode.java:972) > [09:12:53.567] at com.caucho.db.store.Inode.remove(Inode.java:832) It looks like the session store was corrupted by the shutdown. I've added a bug report for us to improve our startup validation at http://bugs.caucho.com/view.php?id=2558 . You can remove the broken session database in resin/session/ srun_<server-id>.db. When the server restarts, it will update itself from the other servers in the cluster. > > [09:12:53.567] at > com.caucho.db.store.Transaction.writeData(Transaction.java:568) > [09:12:53.567] at > com.caucho.db.sql.QueryContext.unlock(QueryContext.java:517) > [09:12:53.567] at > com.caucho.db.sql.RowIterateExpr.nextBlock(RowIterateExpr.java:86) > [09:12:53.567] at com.caucho.db.sql.Query.nextBlock(Query.java:713) > [09:12:53.567] at com.caucho.db.sql.Query.nextTuple(Query.java:690) > [09:12:53.567] at > com.caucho.db.sql.DeleteQuery.execute(DeleteQuery.java:81) > [09:12:53.567] at > > This exception appeared many time, but everything appears to be > working > again. I found one reference related to this being cluster store > corruption possibly related to locking issues. Since our fence came > down > (allowing public access, vs beta group only access) spiders have been > hitting our site pretty hard which could result in a lot more lock > contention (several 1000 hits on a server in rapid succession). Not > sure > if this might be related. > > Any idea what the root cause of this hang up was, or what I can do to > prevent it in the future? The thread dump would help narrow the issue down. It's not necessarily related to the error messages you saw on startup. > I'm running resin-pro 3.1.3 with a license. I can't upgrade to 3.1.4 > or > 3.1.5 due to the smarty regular expression issue in them (maybe it is > fixed in the current snapshot???) I believe the regexp issue is fixed in the snapshot, but we do not ever recommend using the snapshot for a production site. -- Scott > > > Any help/suggestions are much appreciated. > > Andrew > > > _______________________________________________ > resin-interest mailing list > resin-interest@caucho.com > http://maillist.caucho.com/mailman/listinfo/resin-interest _______________________________________________ resin-interest mailing list resin-interest@caucho.com http://maillist.caucho.com/mailman/listinfo/resin-interest