I thought I'd update you all on the problems we have been having with
our production deployment. We finally rolled out an update that include
proper pool sizes for the components and the fix to the Castor mapping
file. However, before we did that our system engineer provided some
valuable insight. He provided a graph that shows one CPU going into a
hard loop. About 7 minutes later the system became completely congested
and ran out of threads. The pool and mapping file changes helped in
that when the first failure after the changes occured it took about 30
minutes for the system to run out of threads.
However, that information caused me to go back and look at my stack
traces. It turns out that everyone single one (with the exception noted
below) showed one thread doing the same thing. Now, we first deployed
this product in March of 2005 and experienced no failures until a
product update was released in August. That update included some new
calculators written in flowscript. This version of Cocoon is using
rhino1.5r4-continuations-20040629T1232.jar. The stack traces indicate
that these are going into a loop and causing the system to die. At
first we thought that the calculators were not doing proper input
validation and causing the wierd things to happen. The stack traces kind
of supported this in that they look like:
"http-8080-Processor18" daemon prio=1 tid=0x30e38df8 nid=0x51a8 runnable
[2d351000..2d35387c]
at java.lang.Class.isPrimitive(Native Method)
at
org.mozilla.javascript.NativeJavaObject.getConversionWeight(NativeJavaObject.java:324)
at
org.mozilla.javascript.NativeJavaObject.canConvert(NativeJavaObject.java:259)
at
org.mozilla.javascript.NativeJavaMethod.findFunction(NativeJavaMethod.java:356)
at
org.mozilla.javascript.NativeJavaMethod.call(NativeJavaMethod.java:193)
at org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1244)
at
org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:1134)
at
org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:190)
at
org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:138)
at
org.mozilla.javascript.continuations.InterpretedFunctionImpl.call(InterpretedFunctionImpl.java:121)
at org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1244)
at
org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1591)
at
org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter.handleContinuation(FOM_JavaScriptInterpreter.java:812)
- locked <0x66005778> (a
org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter$ThreadScope)
at
org.apache.cocoon.components.treeprocessor.sitemap.CallFunctionNode.invoke(CallFunctionNode.java:123)
However, we have been able to recreate the loop without entering bad
data. In addition, we got a trace that was close to the start of the
loop and it is somewhat different. It seems to imply that there is
something wrong with Continuation handling, but I have no idea. Two
traces taken a few minutes later both looked like the one above.
at org.mozilla.javascript.Interpreter.doubleWrap(Interpreter.java:2491)
at
org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:657)
at
org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:190)
at
org.mozilla.javascript.continuations.ContinuationInterpreter.interpret(ContinuationInterpreter.java:138)
at
org.mozilla.javascript.continuations.InterpretedFunctionImpl.call(InterpretedFunctionImpl.java:121)
at org.mozilla.javascript.ScriptRuntime.call(ScriptRuntime.java:1244)
at
org.mozilla.javascript.ScriptableObject.callMethod(ScriptableObject.java:1591)
at
org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter.handleContinuation(FOM_JavaScriptInterpreter.java:812)
- locked <0x66005880> (a
org.apache.cocoon.components.flow.javascript.fom.FOM_JavaScriptInterpreter$ThreadScope)
at
org.apache.cocoon.components.treeprocessor.sitemap.CallFunctionNode.invoke(CallFunctionNode.java:123)
We have a way to get around this problem by replacing the flowscript
calculators with CGIs for the time being. However, we will want to do
something about this problem in the future. One difficulty in debugging
this problem though is that we have no idea which calculators are
running or where they are at the time of the failure because interpreted
javascript doesn't show up in the stack trace. As a consequence I will
probably recommend that they be rewritten as JSR-168 portlets instead of
using flow - unless someone has a better idea.
Thoughts and comments are welcome.
Ralph