The system:
I have an application in production, peak usage at about 100
concurrent users. (Dual CPU - 1G Ram). The box handles the load like
it aint no thang, as the majority of the application and number
crunching is going on in Oracle. The JRun instance use the appropriate
amount of cpu and ram. Nothing out of the ordinary as far as load is
concerned.

The problem:
Every 1-3 days the system comes to a screeching halt, and eventually
throws the all too common JRun Connection errors. CF nor Oracle throw
any types of errors to indicate the source of the problem. However,
the JRun instance logs reads:

java.lang.RuntimeException: Request timed out waiting for an available
thread to run. You may want to consider increasing the number of
active threads in the thread pool.
at jrunx.scheduler.ThreadPool$Throttle.enter(ThreadPool.java:125)
at jrunx.scheduler.ThreadPool$ThreadThrottle.invokeRunnable(ThreadPool.java:448)
at jrunx.scheduler.WorkerThread.run(WorkerThread.java:66)

A Technote on the subject directed me to set the 'threadWaitTimeout'
attributes of the JRun ProxyService to match the 'Timeout Requests' of
the CFIDE. Even after doing so the problem persists.

Steven Erat provided a great entry on TalkingTree about performing a
java stack trace to try and find clues on the template which may be
causing the infinite thread lock. (Getting a stack trace is alot
easier than it sounds, especially on a server in production!)

Unfotunately the stack trace hasn't provided me with any golden
nuggets. I performed a series of dumps over a couple second intervals.
As I expected to see threadIDs over time with the same output, I had
hoped to be able to identify some sort of cf template. This is the
dump that all the threads displayed:

"jrpp-4" prio=5 tid=0x3c67f008 nid=0xcd4 runnable [3e95f000..3e95fdbc]
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:353)
- locked <0x13b961f8> (a java.net.PlainSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:448)
at java.net.ServerSocket.accept(ServerSocket.java:419)
at jrun.servlet.network.NetworkService.accept(NetworkService.java:368)
at jrun.servlet.jrpp.JRunProxyService.accept(JRunProxyService.java:104)
at jrun.servlet.jrpp.JRunProxyService.createRunnable(JRunProxyService.java:120)
at jrunx.scheduler.ThreadPool$ThreadThrottle.createRunnable(ThreadPool.java:377)
at jrunx.scheduler.WorkerThread.run(WorkerThread.java:62)

Can anyone offer any insight as to what this dump is telling me.

Aside from this dump I believe the problem has something to do with
Oracle. When googling thread locks and CFMX the word 'oracle' apprears
frequenty. It seems like CF is making a request of Oracle, then sits
there waiting for Oracle to respond ignoring any and all set timeouts.
Like I mentioned above, neither Oracle or CF show any signs of being
under increased loads, infact they have tons of resources to spare.

My best guess is that there is stored procedure that may be performing
a select * on a large table without a where clause (possibly with clob
columns). Which would explain why there aren't any errors on either
side of the request. Since I didn't write this application, I don't
have the time to perform a code review (although IMHO it desperately
needs a complete re-write....) so I'm trying to atleast target a
template so that I can point the developer in the right direction.

I've even gone as far as querying IIS logs to determine which pages
are called leading up to complete site crash. Unfortunately I'm not
able to single out any certain templates. What I did notice is that
ocassionally throughout uptime, IIS logs 503 errors. Which is the
error it logs when the server is too busy to process a request.
(Although this error has never been reported from a user, other than
when the site is completely down)

Short of solving this problem, I've clustered 3 instances of the
application which buys me a little time to restart locked instances
before all three crap out. Actually, now that I think about it I guess
I could also setup some sched tasks to restart a different instance
every night to clear the locked threads. Obviously this band-aid isn't
a long term solution.

So NEways, first let me thank you for reading this long post.
Secondly, if anyone has any ideas please share them.

-Adam
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings] [Donations and Support]

Reply via email to