I have a frustrating segfault problem that I can't seem to track down.

Every once in a while, my AOLserver will segfault in _smalloc (which
is being called from ns_malloc) for no apparent reason.  I can easily
replicate this by subjecting my server to high load, but I do not have
a stand-alone test case (I can't even think of how to go about
constructing one).  Some representative gdb backtraces are below.

What could possibly be causing it to segfault in _smalloc, of all
places?  The memory usage of my process was quite low, about 30 MB or
less, and I've had AOLserver processes go MUCH larger than that on
this box, so I know it wasn't running out of memory.

I'm running AOLserver 3.3+ad12 on Solaris (SunOS 5.8), with my own
custom C module (nsdtkapi.c) for talking to a data feed, plus a
closed-source vendor library in there which my code uses.

I normally use gcc 2.95.2, but I also tried Sun cc (aka, Sun WorkShop
6 update 2 C 5.3 2001/05/15) - seems to make no difference.  Purify
(with both gcc and Sun cc) reports various sorts of "errors", none of
which I can do anything about (in the vendor library), and I'm not at
all convinced they have anything to do with this problem anyway.

The only possibly useful thing I've noticed, is that I have a config
switch which when set, makes heavier use of ns_cond.  When the switch
is on, I seem to get the segfaults more frequently.  (Basically, I'm
getting data coming in, and I have one thread receiving all data.
With the config switch off, that single thread does most of the
processing of the incoming data too.  With the switch off, it instead
sticks the data in an nsv and uses ns_cond to tell another thread,
"Heh, wake up, there's new data for you to process.")  Note that
ns_cond gets used no matter what, it just gets used more with the
switch on.

The only other experiments I've thought of are to try different
versions of AOLserver and/or Tcl.  Basically, I'm mostly out of ideas
and insight, so if anyone has any, I'd appreciate hearing it.  Thanks!

Backtraces from core files:


Died while my C code was setting an nsv:

(gdb) bt
#0  0xff14124c in _smalloc () from /lib/libc.so.1
#1  0xff141294 in malloc () from /lib/libc.so.1
#2  0x1414b4 in ns_malloc (size=9) at memory.c:126
#3  0x796e8 in UpdateVar (hPtr=0x884070, value=0xfe3fa494 "TODAY_DT")
    at tclvar.c:628
#4  0x7978c in SetVar (arrayPtr=0x174b070,
    key=0xfe3fa6a4 "dtk_field_mnemonic.42", value=0xfe3fa494 "TODAY_DT")
    at tclvar.c:660
#5  0x7813c in NsTclVSetCmd (dummy=0x73, interp=0x658710, argc=4,
    argv=0xfe3fa3f8) at tclvar.c:165
#6  0xfe7d49a4 in DTK_SecurityIdType_GetName (id=6653712) at nsdtkapi.c:1281
#7  0xfe7d5e4c in DTK_GetX_InsertNsv_ForAllFields (interp=0x658710, conId=2,
    dtk_cmd=1578, dtk_request_id=45, n_fields=0, field_ids=0x1c64330,
    field_mnemonic_list=0x0) at nsdtkapi.c:1955
#8  0xfe7d6f28 in DTK_GetX (interp=0x658710, dtk_cmd=2, conId=0, n_fields=45,
    field_mnemonic_list=0x1c64330, n_securities=750, security_list=0xdb2268,
    security_type_list=0x0, reqIdArrPtr=0xfe3fc3ac, requestsPtr=0xfe3fc3a8,
    start_date=0, end_date=0, bit_flags=-2147483645, bar_minutes=0,
    field_code=0) at nsdtkapi.c:2543

#28 0x6e1c8 in EvalScript (arg=0x5d8498) at tclsched.c:562
#29 0x6e334 in NsTclThread (arg=0x5d8498) at tclsched.c:663
#30 0x1402a0 in NsThreadMain (arg=0x5d8c98) at thread.c:228
(gdb)


Died after _ns_cleanupinterp called a bunch of stuff?

(gdb) bt
#0  0xff14124c in _smalloc () from /lib/libc.so.1
#1  0xff141294 in malloc () from /lib/libc.so.1
#2  0x1414b4 in ns_malloc (size=10) at memory.c:126
#3  0x123314 in TclpAlloc (nbytes=10) at ./../generic/nsthreads.c:739
#4  0x897c0 in Tcl_Alloc (size=10) at ./../generic/tclCkalloc.c:846
#5  0x102204 in Tcl_NewStringObj (bytes=0x643e28 "errorCode", length=9)
    at ./../generic/tclStringObj.c:170
#6  0x91684 in InfoGlobalsCmd (dummy=0x0, interp=0x5e7330, objc=2,
    objv=0x6623a0) at ./../generic/tclCmdIL.c:986
#7  0x905c0 in Tcl_InfoObjCmd (clientData=0x0, interp=0x5e7330, objc=2,
    objv=0x6623a0) at ./../generic/tclCmdIL.c:416
#8  0xb6d24 in TclExecuteByteCode (interp=0x5e7330, codePtr=0x995b58)
    at ./../generic/tclExecute.c:845
#9  0x8358c in Tcl_EvalObjEx (interp=0x5e7330, objPtr=0x9514a8, flags=0)
    at ./../generic/tclBasic.c:2733
#10 0xfbbc4 in TclObjInterpProc (clientData=0x64d810, interp=0x5e7330, objc=2,
    objv=0xfe4f88b8) at ./../generic/tclProc.c:1001
#11 0xed9e8 in EvalObjv (interp=0x5e7330, objc=2, objv=0xfe4f88b8,
    command=0xfe4f8b94 "_ns_cleanupinterp 1", length=19, flags=0)
    at ./../generic/tclParse.c:932
#12 0xee578 in Tcl_EvalEx (interp=0x5e7330,
    script=0xfe4f8b94 "_ns_cleanupinterp 1", numBytes=19, flags=262144)
    at ./../generic/tclParse.c:1393
#13 0x23c50 in NsTclEval (interp=0x5e7330,
    script=0xfe4f8b94 "_ns_cleanupinterp 1") at tclstubs.cpp:225
#14 0x64170 in CleanupData (tdPtr=0x5eb180) at tclinit.c:1341
#15 0x62860 in Ns_TclDeAllocateInterp (interp=0x5e7330) at tclinit.c:446
#16 0x692b0 in Ns_TclEval (pds=0xfe4f8dd8, server=0x0,
#17 0xfe7d4d6c in DTK_NsvExists (
    nsvString=0xfe4fa55c "dtk_conn_open_requests.0", keyString=0xfe4fb1bc "446")
    at nsdtkapi.c:1458

#30 0xee9c4 in Tcl_Eval (interp=0x5e7330,
    string=0x5d8310 "dtk_get_data_and_log 0") at ./../generic/tclParse.c:1512
#31 0x6e1c8 in EvalScript (arg=0x5d8310) at tclsched.c:562
#32 0x6e334 in NsTclThread (arg=0x5d8310) at tclsched.c:663
#33 0x1402a0 in NsThreadMain (arg=0x5df420) at thread.c:228
(gdb)


Died after _ns_cleanupinterp called a bunch of stuff?

(gdb) bt
#0  0xff14124c in _smalloc () from /lib/libc.so.1
#1  0xff141294 in malloc () from /lib/libc.so.1
#2  0x1414b4 in ns_malloc (size=10) at memory.c:126
#3  0x123314 in TclpAlloc (nbytes=10) at ./../generic/nsthreads.c:739
#4  0x897c0 in Tcl_Alloc (size=10) at ./../generic/tclCkalloc.c:846
#5  0x102204 in Tcl_NewStringObj (bytes=0x643e28 "errorCode", length=9)
    at ./../generic/tclStringObj.c:170
#6  0x91684 in InfoGlobalsCmd (dummy=0x0, interp=0x5e7330, objc=2,
    objv=0x6623a0) at ./../generic/tclCmdIL.c:986
#7  0x905c0 in Tcl_InfoObjCmd (clientData=0x0, interp=0x5e7330, objc=2,
    objv=0x6623a0) at ./../generic/tclCmdIL.c:416
#8  0xb6d24 in TclExecuteByteCode (interp=0x5e7330, codePtr=0x995b58)
    at ./../generic/tclExecute.c:845
#9  0x8358c in Tcl_EvalObjEx (interp=0x5e7330, objPtr=0x9514a8, flags=0)
    at ./../generic/tclBasic.c:2733
#10 0xfbbc4 in TclObjInterpProc (clientData=0x64d810, interp=0x5e7330, objc=2,
    objv=0xfe4f88b8) at ./../generic/tclProc.c:1001
#11 0xed9e8 in EvalObjv (interp=0x5e7330, objc=2, objv=0xfe4f88b8,
    command=0xfe4f8b94 "_ns_cleanupinterp 1", length=19, flags=0)
    at ./../generic/tclParse.c:932
#12 0xee578 in Tcl_EvalEx (interp=0x5e7330,
    script=0xfe4f8b94 "_ns_cleanupinterp 1", numBytes=19, flags=262144)
    at ./../generic/tclParse.c:1393
#13 0x23c50 in NsTclEval (interp=0x5e7330,
    script=0xfe4f8b94 "_ns_cleanupinterp 1") at tclstubs.cpp:225
#14 0x64170 in CleanupData (tdPtr=0x5eb180) at tclinit.c:1341
#15 0x62860 in Ns_TclDeAllocateInterp (interp=0x5e7330) at tclinit.c:446
#16 0x692b0 in Ns_TclEval (pds=0xfe4f8dd8, server=0x0,
    script=0xfe4f8ff4 "nsv_exists dtk_conn_open_requests.0 446") at tclop.c:124
#17 0xfe7d4d6c in DTK_NsvExists (
    nsvString=0xfe4fa55c "dtk_conn_open_requests.0", keyString=0xfe4fb1bc "446")
    at nsdtkapi.c:1458
#18 0xfe7d97d4 in DTK_DecodeFields (interp=0x5e7330, svc_code=38,
    structPtr=0xc65268, resultSetPtr=0xd6fad0, conId=0, dtk_response_id=445,
    request_id_ptr=0xfe4fc918, ds_svc_code=0xfe4fc0a0) at nsdtkapi.c:3845
#19 0xfe7d8a10 in DTK_ReceiveData (interp=0x5e7330, conId=0,
    resultSetPtr=0xd6fad0, request_id_ptr=0xfe4fc918, ds_svc_code=0xfe4fc0a0)
    at nsdtkapi.c:3402
#20 0xfe7e2790 in DTK_TclCmd (data=0x0, interp=0x5e7330, argc=4,
    argv=0xfe4fcc18) at nsdtkapi.c:6491

#31 0x6e1c8 in EvalScript (arg=0x5d8310) at tclsched.c:562
#32 0x6e334 in NsTclThread (arg=0x5d8310) at tclsched.c:663
#33 0x1402a0 in NsThreadMain (arg=0x5df420) at thread.c:228
(gdb)

--
Andrew Piskorski <[EMAIL PROTECTED]>
http://www.piskorski.com

Reply via email to