On Feb 11, 2008, at 11:00 PM, Daniel López wrote:

> Hi,
> That would be me reporting :). Yeah, the different port in the end  
> was a
> miunderstading, but had nothing to do with the bug itself. I have been
> unable/unwilling to try to reproduce it, as the machines where it
> happens is the production server itself, so I cannot do my tests  
> there.

Ok.  I'd certainly not recommend reproducing this in production :)

> The situation when it would happen goes like this: A database has gone
> wild and causes some web apps in a JVM to be stuck, I try to stop the
> resin instance but it doesn't, as I can see the O.S. process still  
> lying
> there. I kill it hard with kill -9 and the O.S. process is gone. After
> that, I try to start the instance and the watchdog process refuses
> saying that 'server X is already running'. After that I am unable to
> start the server unless I kill all the instances and the watchdog and
> start them all over again.

Thanks.  I've added that description to the bug.

> If the watchdog process communicates with the instances through a TCP
> port and it believes an instance is running... does the watchdog  
> process
> communicate with the instances to see if they are running?

The OS tells the watchdog when the child dies.  Since it's the parent  
process of the Resin instance, it has some extra, direct  
information.   So it's unrelated to the socket.

Currently, the watchdog doesn't know if the child has frozen, since  
that's not an OS state.  From the point of view of the watchdog, a  
frozen child is still alive.  (Eventually, we'll add some kind of ping  

The specific problem seems to be a state-management one.
   1) If you kill a child, the watchdog will detect that and restart a  
new instance
   2) But, if the watchdog's stop fails, it's possible the state  
becomes something unmanageable.  (I don't see how that's happening,  
but that matches the symptoms.)

> Or does it simply check that the socket is "still up"? Killing the  
> process might
> have left the socket in an undeterminate state for a while, until the
> S.O. cleans it, so in that case it might explain why the watchdog
> process could believe the instance is still up.
> Knut, you seem to have a setup quite similar to ours, are you using  
> 3.1
> already or staying if 3.0? If you are using 3.1, are you experiencing
> similiar issues or is it just me? ;).
> On the security front, we have one instance where we use a
> security.policy file to sandbox each application in its own  
> environment,
> but it just has two applications with very low usage, and that's why  
> we
> have not had performance problems, even though I have not taken the  
> time
> to see how much it interferes. That is a 2.1 instance though, so the
> policy file would probably need some tweaks to get it up to 3.0.
> S!
> D.
> Scott Ferguson escribió:
>> On Feb 11, 2008, at 2:54 PM, Knut Forkalsrud wrote:
>>> Scott Ferguson wrote:
>>>> There have been some changes to the watchdog.  I haven't been  able
>>>> to duplicate the exact failure scenarios yet, so it would be very
>>>> helpful if someone who can reproduce the failures can give me a
>>>> simple  sequence to test.
>>> I believe the issues in http://bugs.caucho.com/view.php?id=2409 and
>>> http://bugs.caucho.com/view.php?id=2410 turned out to be a
>>> misunderstanding about the difference between the watchdog port and
>>> the cluster port.  At least that is what I read out of message id
>>> (attached).
>> That was #2409, which I closed.
>> I thought #2410 was different.  If the watchdog-port is the same for
>> several instances (or unset) it was possible for a server to get  
>> stuck,
>> so you couldn't start or stop it.  Since that would be an annoying
>> situation, I'd like to track that down an fix it.
>> -- Scott
> _______________________________________________
> resin-interest mailing list
> resin-interest@caucho.com
> http://maillist.caucho.com/mailman/listinfo/resin-interest

resin-interest mailing list

Reply via email to