Hi Dossy,
I have tried setting stacksize to 512k / 1MB, and I still receive the
realloc error. I've also disabled vm_overcommit_memory (by setting it
to 2) and it didn't help, unfortunately. For reference, here were the
stats:
$ grep Commit /proc/meminfo
CommitLimit: 2081908 kB
Committed_AS: 326728 kB
$ sysctl vm.overcommit_{memory,ratio}
vm.overcommit_memory = 0
vm.overcommit_ratio = 50
>How much swap do you have configured?
I have 512MB of swap configured, and its usage seems to vary between 0kb
and 150kb. I think 512 is sufficient but I can increase it if necessary.
After some experimentation last night, I was able to get the memory
usage past 1G by running some simple scripts like this in parallel:
set a "1" ; for { set i 0 } { $i < 300000000 } { incr i } { append a
"11111111111111"}
The usage climbed to:
12199 nsadmin 17 0 1591m 1.0g 2984 S 200 34.8 3:37.78 nsd
and it seemed to be stable while running these scripts.
I am beginning to think it's less of a memory issue and more of a
"something's not thread safe" issue (even though i've removed all my
custom modules). I can't reproduce this problem with simulations, but as
soon as I switch my real users to the new server, it begins restarting
within a couple of minutes. The fact that it restarts at around
90-110MB could just be a coincidence.
But I have no idea how to find the culprit. When I switched from
3.3+ad13 to 4.0.10, my server crashed all the time, and I debugged nsd
and found that ns_server was at fault. So I removed references to it,
and everything worked fine.
Now it's crashing in the tcl interpreter, and I have no idea how to get
the underlying tcl code from that (and even if I did, it's probably not
related to the actual cause of the issue, which is more likely heap
corruption from a different thread).
Since all things are otherwise equal between my FC4 and FC5 box, that
might indicate a glibc issue or something similar. But I've found it
difficult to link against a different glibc (since your
bintools/gcc/kernel have to match closely, and upgrading glibc can break
the rest of your system).
Some more info, if it helps: My server receives a medium-high amount of
traffic, about 4 million requests/day, sometimes as high as 200
requests/second.
I don't use ns_server anymore, but I do use [ns_info pageroot], the
"source" command, exec (to run imagemagick, expect scripts, and other
things - probably 1-2 execs/second). I get/set something in ns_cache on
almost every request, although the caches themselves are relatively
small (200-300k), and I believe I was able to reproduce the problem even
with ns_cache disabled. I use postgresql over tcp/ip, 3 pools, 5
connections per pool (I tried Marc's maxidle/open suggestion with no
success). And I don't use ADP at all - only tcl/static content.
Thanks for all your help so far.
-Hossein
--
AOLserver - http://www.aolserver.com/
To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]>
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject:
field of your email blank.