Hiya

I've been rebuilding our test systems with the new 4.5.0 release, and happily have had no issues with the upgrade (other than a couple of small build issues). One problem has come up, but it's taking advantage of a new feature. Some background first...

We have a "timeout" module, that spawns a monitor thread and watches connections to ensure they don't exceed either a default timeout, or one they set themselves. At the moment all we do is fire a signal handler which sets a global variable. We then monitor this in certain bits of code (primarily inside a customised nsoracle module which is using non-blockingmode - as this is typically where our scripts get held up), and raise errors when and where we can. However, using the new [ns_ictl cancel] command we can actually cover a lot more cases and for example break out of looping TCL code (caused by programmer error, or a condition that's unexpected failing to be met).

To implement this I've just extracted the lines called by that sub command from NsTclICtlObjCmd() in nsd/tclinit.c into a standalone function:

/* Implements ns_ictl cancel  */
int
Ns_ICtlCancel(int threadid)
{
        Tcl_HashEntry *hPtr;
        TclData *dataPtr;

        Ns_MutexLock(&tlock);
        hPtr = Tcl_FindHashEntry(&threads, (char *) threadid);
        if (hPtr != NULL) {
            dataPtr = Tcl_GetHashValue(hPtr);
            Tcl_AsyncMark(dataPtr->cancel);
        }
        Ns_MutexUnlock(&tlock);
        if (hPtr == NULL) {
                return NS_ERROR;
        }
        return NS_OK;
}

I then simply call the above from inside our timeout module (we already have the thread id available, so it's a one line addition).

When I was testing this yesterday it was all working precisely as expected. Fantastic! However, overnight both of the servers I installed this on fell over - seemingly the first time the code was invoked after midnight. I'll try to replicate this later. However, looking at the core dumps:

#0  0x00a897a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00ac97d5 in raise () from /lib/tls/libc.so.6
#2  0x00acb149 in abort () from /lib/tls/libc.so.6
#3  0x001cbe3a in Abort (signal=11) at unix.c:365
#4  <signal handler called>
#5 0x00766db0 in ResetObjResult (iPtr=0x0) at /tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclResult.c:824 #6 0x00766d30 in Tcl_ResetResult (interp=0x0) at /tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclResult.c:787 #7 0x001b8ac1 in AsyncCancel (ignored=0x0, interp=0x0, code=0) at tclinit.c:2086 #8 0x006f9cc2 in Tcl_AsyncInvoke (interp=0x0, code=0) at /tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclAsync.c:256 #9 0x00757e53 in Tcl_ServiceEvent (flags=-3) at /tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclNotify.c:590 #10 0x00758305 in Tcl_DoOneEvent (flags=-3) at /tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclNotify.c:945 #11 0x0072992f in Tcl_VwaitObjCmd (clientData=0x0, interp=0x9aabc20, objc=2, objv=0x970a390) at /tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclEvent.c:1101 #12 0x006fc93a in TclEvalObjvInternal (interp=0x9aabc20, objc=2, objv=0x970a390, command=0x0, length=0, flags=0) at /tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclBasic.c:3087

It's clear where the problem is - AsyncCancel has been passed NULL where it's expecting a valid interpreter pointer. So I've added a quick check to that for the time being, to simply log such events rather than crash the server. However, it's got a NULL because one was passed to Tcl_AsyncInvoke - in this case that looks pretty intentional, tclNotify.c:590 is:

        (void) Tcl_AsyncInvoke((Tcl_Interp *) NULL, 0);

and the comments in Tcl_AsyncInvoke() make it clear it expects to sometimes be called with a NULL.

Now I don't know all the details of why Tcl_AsyncInvoke is invoked with a NULL, and what we want to do in that situation... so I thought I'd ask here before trying to follow large amounts of the TCL and AOLserver source through by hand. :)

Clearly if that is valid behaviour by TCL, then AsyncCancel must be modified. Is it OK for us to look up our interpreter (Ns_TclGetConn + Ns_GetConnInterp)? If not... The return code is being ignored, so how can we cause an error - or will we be invoked again (this time with an interp), at which point we can do our job of provoking an error and so script cancellation?

TIA

PS: Ff other people are potentially interested in this module, then please let me know. It would also be good to know whether there would be any objections to modifying the core slightly (extending the conn struct and adding some API functions), so that the module can be better integrated. At present it requires that you (as an interpreter) add and remove yourself at the start and end of a connection. This is fine for us as we have a ns_register_proc'd TCL wrapper on all our requests anyway.

--
Stuart Children
http://terminus.co.uk/


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]> 
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: 
field of your email blank.

Reply via email to