On Mar 27, 2008, at 7:47 AM, Andrew Fritz wrote:

> Well, I was planning to make a general "going to production,  
> anything I
> should tune in resin for prime time" post this morning, but turns  
> out I
> have a crash related . We had our first outage related (as far as I  
> can
> tell) to resin.  (our site actually became available publicly  
> outside of
> our beta group on Tuesday).

> Both of the servers in our cluster stopped responding at the same time
> and java started using 100% of all CPU resources.

In this kind of situation, a thread dump is always helpful.  There's a  
tab in the /resin-admin which will give you a thread dump (or you can  
always use the JDK's capabilities.)

> Upon killing one
> server, the other began responding almost immediately.  Restarting the
> dead server resulting in MANY exceptions (all roughly the same):

> [09:12:53.567] java.lang.IllegalStateException: Can't yet support  
> data over 64M
> [09:12:53.567]  at  
> com.caucho.db.store.Inode.readFragmentAddr(Inode.java:972)
> [09:12:53.567]  at com.caucho.db.store.Inode.remove(Inode.java:832)

It looks like the session store was corrupted by the shutdown.  I've  
added a bug report for us to improve our startup validation at 

You can remove the broken session database in resin/session/ 
srun_<server-id>.db.  When the server restarts, it will update itself  
from the other servers in the cluster.
> [09:12:53.567]  at  
> com.caucho.db.store.Transaction.writeData(Transaction.java:568)
> [09:12:53.567]  at  
> com.caucho.db.sql.QueryContext.unlock(QueryContext.java:517)
> [09:12:53.567]  at  
> com.caucho.db.sql.RowIterateExpr.nextBlock(RowIterateExpr.java:86)
> [09:12:53.567]  at com.caucho.db.sql.Query.nextBlock(Query.java:713)
> [09:12:53.567]  at com.caucho.db.sql.Query.nextTuple(Query.java:690)
> [09:12:53.567]  at  
> com.caucho.db.sql.DeleteQuery.execute(DeleteQuery.java:81)
> [09:12:53.567]  at

> This exception appeared many time, but everything appears to be  
> working
> again.  I found one reference related to this being cluster store
> corruption possibly related to locking issues. Since our fence came  
> down
> (allowing public access, vs beta group only access) spiders have been
> hitting our site pretty hard which could result in a lot more lock
> contention (several 1000 hits on a server in rapid succession). Not  
> sure
> if this might be related.
> Any idea what the root cause of this hang up was, or what I can do to
> prevent it in the future?

The thread dump would help narrow the issue down.  It's not  
necessarily related to the error messages you saw on startup.

> I'm running resin-pro 3.1.3 with a license. I can't upgrade to 3.1.4  
> or
> 3.1.5 due to the smarty regular expression issue in them (maybe it is
> fixed in the current snapshot???)

I believe the regexp issue is fixed in the snapshot, but we do not  
ever recommend using the snapshot for a production site.

-- Scott

> Any help/suggestions are much appreciated.
> Andrew
> _______________________________________________
> resin-interest mailing list
> resin-interest@caucho.com
> http://maillist.caucho.com/mailman/listinfo/resin-interest

resin-interest mailing list

Reply via email to