That would be me reporting :). Yeah, the different port in the end was a 
miunderstading, but had nothing to do with the bug itself. I have been 
unable/unwilling to try to reproduce it, as the machines where it 
happens is the production server itself, so I cannot do my tests there.

The situation when it would happen goes like this: A database has gone 
wild and causes some web apps in a JVM to be stuck, I try to stop the 
resin instance but it doesn't, as I can see the O.S. process still lying 
there. I kill it hard with kill -9 and the O.S. process is gone. After 
that, I try to start the instance and the watchdog process refuses 
saying that 'server X is already running'. After that I am unable to 
start the server unless I kill all the instances and the watchdog and 
start them all over again.

If the watchdog process communicates with the instances through a TCP 
port and it believes an instance is running... does the watchdog process 
communicate with the instances to see if they are running? Or does it 
simply check that the socket is "still up"? Killing the process might 
have left the socket in an undeterminate state for a while, until the 
S.O. cleans it, so in that case it might explain why the watchdog 
process could believe the instance is still up.

Knut, you seem to have a setup quite similar to ours, are you using 3.1 
already or staying if 3.0? If you are using 3.1, are you experiencing 
similiar issues or is it just me? ;).

On the security front, we have one instance where we use a 
security.policy file to sandbox each application in its own environment, 
but it just has two applications with very low usage, and that's why we 
have not had performance problems, even though I have not taken the time 
to see how much it interferes. That is a 2.1 instance though, so the 
policy file would probably need some tweaks to get it up to 3.0.


Scott Ferguson escribió:
> On Feb 11, 2008, at 2:54 PM, Knut Forkalsrud wrote:
>> Scott Ferguson wrote:
>>> There have been some changes to the watchdog.  I haven't been  able 
>>> to duplicate the exact failure scenarios yet, so it would be very 
>>>  helpful if someone who can reproduce the failures can give me a 
>>> simple  sequence to test.
>> I believe the issues in http://bugs.caucho.com/view.php?id=2409 and 
>> http://bugs.caucho.com/view.php?id=2410 turned out to be a 
>> misunderstanding about the difference between the watchdog port and 
>> the cluster port.  At least that is what I read out of message id 
>> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> (attached).
> That was #2409, which I closed.
> I thought #2410 was different.  If the watchdog-port is the same for 
> several instances (or unset) it was possible for a server to get stuck, 
> so you couldn't start or stop it.  Since that would be an annoying 
> situation, I'd like to track that down an fix it.
> -- Scott

resin-interest mailing list

Reply via email to