I thought I'd update you all on the problems we have been having with our production deployment. We finally rolled out an update that include proper pool sizes for the components and the fix to the Castor mapping file. However, before we did that our system engineer provided some valuable insight. He provided a graph that shows one CPU going into a hard loop. About 7 minutes later the system became completely congested and ran out of threads. The pool and mapping file changes helped in that when the first failure after the changes occured it took about 30 minutes for the system to run out of threads.

However, that information caused me to go back and look at my stack traces. It turns out that everyone single one (with the exception noted below) showed one thread doing the same thing. Now, we first deployed this product in March of 2005 and experienced no failures until a product update was released in August. That update included some new calculators written in flowscript. This version of Cocoon is using rhino1.5r4-continuations-20040629T1232.jar. The stack traces indicate that these are going into a loop and causing the system to die. At first we thought that the calculators were not doing proper input validation and causing the wierd things to happen. The stack traces kind of supported this in that they look like:

"http-8080-Processor18" daemon prio=1 tid=0x30e38df8 nid=0x51a8 runnable [2d351000..2d35387c]
   at java.lang.Class.isPrimitive(Native Method)
at org.mozilla.javascript.NativeJavaObject.getConversionWeight(NativeJavaObject.java:324) at org.mozilla.javascript.NativeJavaObject.canConvert(NativeJavaObject.java:259) at org.mozilla.javascript.NativeJavaMethod.findFunction(NativeJavaMethod.java:356) at org.mozilla.javascript.NativeJavaMethod.call(NativeJavaMethod.java:193)
   at org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1244)
at org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:1134) at org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:190) at org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:138) at org.mozilla.javascript.continuations.InterpretedFunctionImpl.call(InterpretedFunctionImpl.java:121)
   at org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1244)
at org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1591) at org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter.handleContinuation(FOM_JavaScriptInterpreter.java:812) - locked <0x66005778> (a org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter$ThreadScope) at org.apache.cocoon.components.treeprocessor.sitemap.CallFunctionNode.invoke(CallFunctionNode.java:123)

However, we have been able to recreate the loop without entering bad data. In addition, we got a trace that was close to the start of the loop and it is somewhat different. It seems to imply that there is something wrong with Continuation handling, but I have no idea. Two traces taken a few minutes later both looked like the one above.

   at org.mozilla.javascript.Interpreter.doubleWrap(Interpreter.java:2491)
at org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:657) at org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:190) at org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:138) at org.mozilla.javascript.continuations.InterpretedFunctionImpl.call(InterpretedFunctionImpl.java:121)
   at org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1244)
at org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1591) at org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter.handleContinuation(FOM_JavaScriptInterpreter.java:812) - locked <0x66005880> (a org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter$ThreadScope) at org.apache.cocoon.components.treeprocessor.sitemap.CallFunctionNode.invoke(CallFunctionNode.java:123)

We have a way to get around this problem by replacing the flowscript calculators with CGIs for the time being. However, we will want to do something about this problem in the future. One difficulty in debugging this problem though is that we have no idea which calculators are running or where they are at the time of the failure because interpreted javascript doesn't show up in the stack trace. As a consequence I will probably recommend that they be rewritten as JSR-168 portlets instead of using flow - unless someone has a better idea.

Thoughts and comments are welcome.

Ralph