The attached program is about as small as I can make a test app that
exemplifies the problem that my server application is having. I have posted
about it repeatedly with no results, probably because nobody can (or wants
to <g>) reproduce it. This little test program is only about 160 lines long
with comments. It just tries to keep a bunch of transient threads going at
once (the threads don't do anything - they just exit after sleeping for a
millisecond).
<<comp>> <<link>> <<tst.cpp>>
This problem happens on SPARC Solaris. This program demonstrates the
problem very quickly (usually within a minute) on both a SPARC Ultra-2 with
Solaris 2.6, and a SPARC Ultra-60 with Solaris 2.8. My "real" app doesn't
crash nearly this fast, as it doesn't put nearly the stress-test on OpenSSL
that the test app does - but it most certainly crashes every time I test it;
it just takes hours instead of seconds.
Can anyone reproduce this and fix it? I'm in a VERY bad spot here because I
can't ship my product until I get OpenSSL to work. My company pretty much
threw sand in RSA's face in favor of using OpenSSL, on my recommendation,
and now I can't make OpenSSL work and we can't ship my product. This is
hardly a great career move for me. If anyone can identify and fix this bug,
I would greatly appreciate it. I look pretty stupid right now to the folks
in upper management, and I feel like my hands are tied. I'm trying to use
Purify to determine the problem, but I've never used it before and will
probably be slow to figure out how to make it work and understand exactly
what it's telling me.
If anyone sees any obvious misuse problem, PLEASE let me know. I would LOVE
to hear "you're doing it wrong - you forgot to make this function call!" and
be done with it, but as far as I can tell, I'm obeying the OpenSSL usage
laws to the letter.
If you run the "comp" and "link" scripts to build this little test program,
then run the resultant "tst" executable, it should crash after a short time
and if you run dbx against the resultant core, you should get the following
stack in response to the dbx "where" command:
core file header read successfully
Reading ld.so.1
Reading libsocket.so.1
Reading libCrun.so.1
Reading libm.so.1
Reading libw.so.1
Reading libthread.so.1
Reading libc.so.1
Reading libnsl.so.1
Reading libdl.so.1
Reading libmp.so.2
Reading libc_psr.so.1
detected a multithreaded program
t@3937 (l@48) terminated by signal BUS (invalid address alignment)
Current function is ThreadMain
100 int iErr = ERR_get_error ();
(/opt/SUNWspro/bin/../WS6/bin/sparcv9/dbx) where
current thread: t@3937
[1] t_delete(0x9, 0xff2b6000, 0x150, 0x65300, 0x651a8, 0x150), at
0xff241798
[2] realfree(0x9, 0xff2bc7b0, 0xff2b6000, 0x65300, 0x153, 0x65308), at
0xff241420
[3] cleanfree(0x0, 0xff2b6000, 0xff2bc724, 0xff2bc7a4, 0xff2bc730, 0x0),
at 0xff241cb4
[4] _malloc_unlocked(0x60, 0x0, 0xff2b6000, 0x60, 0x5, 0x0), at 0xff240e20
[5] malloc(0x60, 0x60, 0x62798, 0x150, 0x0, 0x0), at 0xff240d3c
[6] CRYPTO_malloc(0x5a5b0, 0x470d0, 0x77, 0x5a400, 0x470d0, 0x60), at
0x17070
[7] lh_new(0x1cba0, 0x1cbb8, 0x470d0, 0x2be, 0x1cbb8, 0x14c), at 0x34604
[8] ERR_get_state(0x5a400, 0x0, 0x673e0, 0x430d8, 0x673e0, 0xf7509b28), at
0x1ce6c
[9] get_error_values(0x1, 0x0, 0x0, 0x0, 0x0, 0x0), at 0x1c4a0
=>[10] ThreadMain(pNothing = (nil)), line 100 in "tst.cpp"
(/opt/SUNWspro/bin/../WS6/bin/sparcv9/dbx) quit
Thanks for your help,
Bill Rebey
comp
link
tst.cpp