Re: [Resin-interest] Best Way to Track Down Random Resin Restarts

Shane Cruz Thu, 17 Feb 2011 16:19:19 -0800

I increased the file descriptors to be safe, but it doesn't appear to be the
issue.  I didn't see the user anywhere near the file descriptor limit on my
checks and I doubt it would be possible for their to be a sudden file
descriptor spike that would open another 5000 files.


One thing that is interesting is that I have been connecting jstat to the
process to see what the heaps look like right before the crash.  The heap
data looks fine, but it appears that the timestamp doesn't increment for
about 10-12 consecutive checks.  Jstat gets disconnected when the process
dies, but this data almost makes it seem like the process is running but
unresponsive for about 10 seconds and then it gets killed.  Would the
wrapper process kill the JVM if it found it to be running but unresponsive?
 Is there anything else in Resin that would kill the Java process if it
determined there was a deadlock or something (we are not using the <ping>
check in resin.conf)?

Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC
FGCT     GCT

        40044.1   0.00   0.00  23.05  83.21  82.56    711   70.353   118
104.280  174.632

        40045.1   0.00   0.00  26.98  83.21  82.56    711   70.353   118
104.280  174.632

        40046.1   0.00   0.00  28.36  83.21  82.56    711   70.353   118
104.280  174.632

        40047.1   0.00   0.00  29.88  83.21  82.56    711   70.353   118
104.280  174.632

        40048.1   0.00   0.00  31.28  83.21  82.56    711   70.353   118
104.280  174.632

        40049.1   0.00   0.00  32.86  83.21  82.56    711   70.353   118
104.280  174.632

        40050.1   0.00   0.00  35.72  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

        40050.7   0.00   0.00  41.97  83.21  82.56    711   70.353   118
104.280  174.632

On Thu, Feb 17, 2011 at 2:06 PM, Scott Ferguson <[email protected]> wrote:

> Shane Cruz wrote:
> > So, with full debug logging turned on, I did see this exception in the
> > logs right before the restart:
> >
> > [13:55:37.603] com.caucho.log.EnvironmentLogger.log
> > com.caucho.config.ConfigException: OpenSSL can't open
> > certificate-chain-file '/nfs/certs/mysite.crt'
> > [13:55:37.603]  at com.caucho.vfs.OpenSSLFactory.open(Native Method)
> > [13:55:37.603]  at
> > com.caucho.vfs.OpenSSLFactory.accept(OpenSSLFactory.java:419)
> > [13:55:37.603]  at com.caucho.server.port.Port.accept(Port.java:813)
> > [13:55:37.603]  at
> > com.caucho.server.port.TcpConnection.run(TcpConnection.java:495)
> > [13:55:37.603]  at
> > com.caucho.util.ThreadPool.runTasks(ThreadPool.java:520)
> > [13:55:37.603]  at com.caucho.util.ThreadPool.run(ThreadPool.java:442)
> > [13:55:37.603]  at java.lang.Thread.run(Thread.java:619)
> > [13:55:37.603]
> > [13:55:49.109] com.caucho.log.EnvironmentLogger.log Server[myserver1]
> > starting
> >
> > That certificate is getting loaded over NFS. Is there a chance that a
> > certificate loading failure due to an NFS issue could cause the JVM to
> > exit?  I thought the certificate would just be loaded one time at
> > startup, but it looks like maybe it accesses it during runtime as well.
>
> Possibly an issue running out of file descriptors?
>
> That exception shouldn't cause a restart directly. It would cause that
> thread to exit, but would also start up a new thread to listen to that
> port (because it's assuming the current thread is broken for some reason.)
>
> But you could get a "can't open" if you run out of file descriptors, and
> running out of  file descriptors can force a restart.
>
> -- Scott
> >
> > Unfortunately, on a different JVM, there was a crash that doesn't seem
> > to have the same exception:
> >
> > [13:36:03.102] com.caucho.log.EnvironmentLogger.log allocate
> > PoolItem[jdbc/db1,3340053,com.caucho.sql.ManagedConnectionImpl@744ab820]
> > [13:36:03.102] com.caucho.log.EnvironmentLogger.log allocate
> > PoolItem[jdbc/db2,1020267,com.caucho.sql.ManagedConnectionImpl@2a121a07]
> > [13:36:16.815] com.caucho.log.EnvironmentLogger.log Server[myserver2]
> > starting
> >
> > Scott, what are your thoughts on the certificate issue?  To be safe,
> > we should probably start by not loading the certificate over an NFS
> share.
> >
> > Thanks,
> > Shane
> >
> > On Fri, Feb 11, 2011 at 1:40 PM, Scott Ferguson <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     Shane Cruz wrote:
> >     > We are running Resin Pro 3.0.25 on RHEL 5.5 and using 64-bit Sun
> JDK
> >     > 1.6.0_05.  Recently, we have started seeing several incidents where
> >     > the Resin JVM seems to just randomly get restarted.  There is
> >     nothing
> >     > in the logs to indicate that the JVM was shutdown cleanly or a
> >     restart
> >     > was attempted, the log files just go from displaying regular log
> >     lines
> >     > to displaying the following:
> >     The logging for 4.0 is much more informative. With 3.0 it's a bit
> >     trickier.
> >     >
> >     > [11:24:18.095] com.caucho.log.EnvironmentLogger.log
> Server[myserver]
> >     > starting
> >     >
> >     > Things that have already been checked:
> >     >
> >     > 1. There doesn’t appear to be a JVM crash as no HotSpot Error log
> >     > files are created as they usually would be.
> >     >
> >     > 2. There are no signs in the sudo logs that anyone is manually
> >     > restarting the JVM.
> >     >
> >     > 3. There are no signs in the logs that Resin is restarting
> >     itself even
> >     > though we have a “min-free-memory” setting of 1M.  With higher
> >     values
> >     > of that setting we have seen the JVM get restarted due to low
> >     memory,
> >     > but I am pretty sure logging always indicated that the JVM was
> >     > restarting when this happened before.
> >     >
> >     > 4. We are not using the resin “ping” check that might restart
> >     the JVM
> >     > if it is unresponsive.
> >     >
> >     > 5.     Kernel logging is enabled and it doesn't look like the
> kernel
> >     > is killing it for any reason
> >     >
> >     > It almost seems as if the JVM is just getting a kill -9 and then
> the
> >     > wrapper script is starting it back up.  What is the best way to
> >     track
> >     > down what might be killing the JVM?  We are in the process of
> >     testing
> >     > an upgrade to a newer version of the JDK, but I am not very
> >     confident
> >     > that will fix the problem.  I am going to try to turn on full Resin
> >     > debug logging, but I thought I would reach out in case anyone
> >     else had
> >     > an idea of how to track this down.  Is there a way to wrap the
> Linux
> >     > kill command to find out if that is being run?  Any other
> >     suggestions
> >     > on where to look?
> >     Since a phantom kill is pretty unlikely, I wouldn't spend too much
> >     time
> >     on that theory.
> >
> >     Since you're not getting a hs_* error, the most likely would be
> either
> >     something calling System.exit or System.halt, possibly Resin
> >     itself for
> >     something like running out of threads or memory (although, as you
> >     pointed out, that should be logged.)
> >
> >     Other than that, the restart should only happen if the config files
> >     change (theoretically something like NFS or 'touch' could trigger
> >     that,
> >     but I assume that's not happening.)
> >
> >     -- Scott
> >     >
> >
> ------------------------------------------------------------------------
> >     >
> >     > _______________________________________________
> >     > resin-interest mailing list
> >     > [email protected] <mailto:[email protected]>
> >     > http://maillist.caucho.com/mailman/listinfo/resin-interest
> >     >
> >
> >
> >
> >     _______________________________________________
> >     resin-interest mailing list
> >     [email protected] <mailto:[email protected]>
> >     http://maillist.caucho.com/mailman/listinfo/resin-interest
> >
> >
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > resin-interest mailing list
> > [email protected]
> > http://maillist.caucho.com/mailman/listinfo/resin-interest
> >
>
>
>
> _______________________________________________
> resin-interest mailing list
> [email protected]
> http://maillist.caucho.com/mailman/listinfo/resin-interest
>

_______________________________________________
resin-interest mailing list
[email protected]
http://maillist.caucho.com/mailman/listinfo/resin-interest

Re: [Resin-interest] Best Way to Track Down Random Resin Restarts

Reply via email to