That would be me reporting :). Yeah, the different port in the end was a
miunderstading, but had nothing to do with the bug itself. I have been
unable/unwilling to try to reproduce it, as the machines where it
happens is the production server itself, so I cannot do my tests there.
The situation when it would happen goes like this: A database has gone
wild and causes some web apps in a JVM to be stuck, I try to stop the
resin instance but it doesn't, as I can see the O.S. process still lying
there. I kill it hard with kill -9 and the O.S. process is gone. After
that, I try to start the instance and the watchdog process refuses
saying that 'server X is already running'. After that I am unable to
start the server unless I kill all the instances and the watchdog and
start them all over again.
If the watchdog process communicates with the instances through a TCP
port and it believes an instance is running... does the watchdog process
communicate with the instances to see if they are running? Or does it
simply check that the socket is "still up"? Killing the process might
have left the socket in an undeterminate state for a while, until the
S.O. cleans it, so in that case it might explain why the watchdog
process could believe the instance is still up.
Knut, you seem to have a setup quite similar to ours, are you using 3.1
already or staying if 3.0? If you are using 3.1, are you experiencing
similiar issues or is it just me? ;).
On the security front, we have one instance where we use a
security.policy file to sandbox each application in its own environment,
but it just has two applications with very low usage, and that's why we
have not had performance problems, even though I have not taken the time
to see how much it interferes. That is a 2.1 instance though, so the
policy file would probably need some tweaks to get it up to 3.0.
Scott Ferguson escribió:
> On Feb 11, 2008, at 2:54 PM, Knut Forkalsrud wrote:
>> Scott Ferguson wrote:
>>> There have been some changes to the watchdog. I haven't been able
>>> to duplicate the exact failure scenarios yet, so it would be very
>>> helpful if someone who can reproduce the failures can give me a
>>> simple sequence to test.
>> I believe the issues in http://bugs.caucho.com/view.php?id=2409 and
>> http://bugs.caucho.com/view.php?id=2410 turned out to be a
>> misunderstanding about the difference between the watchdog port and
>> the cluster port. At least that is what I read out of message id
>> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> (attached).
> That was #2409, which I closed.
> I thought #2410 was different. If the watchdog-port is the same for
> several instances (or unset) it was possible for a server to get stuck,
> so you couldn't start or stop it. Since that would be an annoying
> situation, I'd like to track that down an fix it.
> -- Scott
resin-interest mailing list