Hiya
I've been rebuilding our test systems with the new 4.5.0 release, and
happily have had no issues with the upgrade (other than a couple of
small build issues). One problem has come up, but it's taking advantage
of a new feature. Some background first...
We have a "timeout" module, that spawns a monitor thread and watches
connections to ensure they don't exceed either a default timeout, or one
they set themselves. At the moment all we do is fire a signal handler
which sets a global variable. We then monitor this in certain bits of
code (primarily inside a customised nsoracle module which is using
non-blockingmode - as this is typically where our scripts get held up),
and raise errors when and where we can. However, using the new [ns_ictl
cancel] command we can actually cover a lot more cases and for example
break out of looping TCL code (caused by programmer error, or a
condition that's unexpected failing to be met).
To implement this I've just extracted the lines called by that sub
command from NsTclICtlObjCmd() in nsd/tclinit.c into a standalone function:
/* Implements ns_ictl cancel */
int
Ns_ICtlCancel(int threadid)
{
Tcl_HashEntry *hPtr;
TclData *dataPtr;
Ns_MutexLock(&tlock);
hPtr = Tcl_FindHashEntry(&threads, (char *) threadid);
if (hPtr != NULL) {
dataPtr = Tcl_GetHashValue(hPtr);
Tcl_AsyncMark(dataPtr->cancel);
}
Ns_MutexUnlock(&tlock);
if (hPtr == NULL) {
return NS_ERROR;
}
return NS_OK;
}
I then simply call the above from inside our timeout module (we already
have the thread id available, so it's a one line addition).
When I was testing this yesterday it was all working precisely as
expected. Fantastic! However, overnight both of the servers I installed
this on fell over - seemingly the first time the code was invoked after
midnight. I'll try to replicate this later. However, looking at the core
dumps:
#0 0x00a897a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00ac97d5 in raise () from /lib/tls/libc.so.6
#2 0x00acb149 in abort () from /lib/tls/libc.so.6
#3 0x001cbe3a in Abort (signal=11) at unix.c:365
#4 <signal handler called>
#5 0x00766db0 in ResetObjResult (iPtr=0x0) at
/tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclResult.c:824
#6 0x00766d30 in Tcl_ResetResult (interp=0x0) at
/tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclResult.c:787
#7 0x001b8ac1 in AsyncCancel (ignored=0x0, interp=0x0, code=0) at
tclinit.c:2086
#8 0x006f9cc2 in Tcl_AsyncInvoke (interp=0x0, code=0) at
/tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclAsync.c:256
#9 0x00757e53 in Tcl_ServiceEvent (flags=-3) at
/tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclNotify.c:590
#10 0x00758305 in Tcl_DoOneEvent (flags=-3) at
/tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclNotify.c:945
#11 0x0072992f in Tcl_VwaitObjCmd (clientData=0x0, interp=0x9aabc20,
objc=2, objv=0x970a390)
at
/tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclEvent.c:1101
#12 0x006fc93a in TclEvalObjvInternal (interp=0x9aabc20, objc=2,
objv=0x970a390, command=0x0, length=0, flags=0)
at
/tmp/stuartc_builddir/aolserver/tcl8.4.13/unix/../generic/tclBasic.c:3087
It's clear where the problem is - AsyncCancel has been passed NULL where
it's expecting a valid interpreter pointer. So I've added a quick check
to that for the time being, to simply log such events rather than crash
the server. However, it's got a NULL because one was passed to
Tcl_AsyncInvoke - in this case that looks pretty intentional,
tclNotify.c:590 is:
(void) Tcl_AsyncInvoke((Tcl_Interp *) NULL, 0);
and the comments in Tcl_AsyncInvoke() make it clear it expects to
sometimes be called with a NULL.
Now I don't know all the details of why Tcl_AsyncInvoke is invoked with
a NULL, and what we want to do in that situation... so I thought I'd ask
here before trying to follow large amounts of the TCL and AOLserver
source through by hand. :)
Clearly if that is valid behaviour by TCL, then AsyncCancel must be
modified. Is it OK for us to look up our interpreter (Ns_TclGetConn +
Ns_GetConnInterp)? If not... The return code is being ignored, so how
can we cause an error - or will we be invoked again (this time with an
interp), at which point we can do our job of provoking an error and so
script cancellation?
TIA
PS: Ff other people are potentially interested in this module, then
please let me know. It would also be good to know whether there would be
any objections to modifying the core slightly (extending the conn struct
and adding some API functions), so that the module can be better
integrated. At present it requires that you (as an interpreter) add and
remove yourself at the start and end of a connection. This is fine for
us as we have a ns_register_proc'd TCL wrapper on all our requests anyway.
--
Stuart Children
http://terminus.co.uk/
--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]>
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject:
field of your email blank.