Hi Dossy,

I have tried setting stacksize to 512k / 1MB, and I still receive the realloc error. I've also disabled vm_overcommit_memory (by setting it to 2) and it didn't help, unfortunately. For reference, here were the stats:

$ grep Commit /proc/meminfo
CommitLimit:   2081908 kB
Committed_AS:   326728 kB

$ sysctl vm.overcommit_{memory,ratio}
vm.overcommit_memory = 0
vm.overcommit_ratio = 50

>How much swap do you have configured?

I have 512MB of swap configured, and its usage seems to vary between 0kb and 150kb. I think 512 is sufficient but I can increase it if necessary. After some experimentation last night, I was able to get the memory usage past 1G by running some simple scripts like this in parallel:

set a "1" ; for { set i 0 } { $i < 300000000 } { incr i } { append a "11111111111111"}

The usage climbed to:
12199 nsadmin   17   0 1591m 1.0g 2984 S  200 34.8   3:37.78 nsd

and it seemed to be stable while running these scripts.

I am beginning to think it's less of a memory issue and more of a "something's not thread safe" issue (even though i've removed all my custom modules). I can't reproduce this problem with simulations, but as soon as I switch my real users to the new server, it begins restarting within a couple of minutes. The fact that it restarts at around 90-110MB could just be a coincidence. But I have no idea how to find the culprit. When I switched from 3.3+ad13 to 4.0.10, my server crashed all the time, and I debugged nsd and found that ns_server was at fault. So I removed references to it, and everything worked fine.

Now it's crashing in the tcl interpreter, and I have no idea how to get the underlying tcl code from that (and even if I did, it's probably not related to the actual cause of the issue, which is more likely heap corruption from a different thread).

Since all things are otherwise equal between my FC4 and FC5 box, that might indicate a glibc issue or something similar. But I've found it difficult to link against a different glibc (since your bintools/gcc/kernel have to match closely, and upgrading glibc can break the rest of your system).

Some more info, if it helps: My server receives a medium-high amount of traffic, about 4 million requests/day, sometimes as high as 200 requests/second. I don't use ns_server anymore, but I do use [ns_info pageroot], the "source" command, exec (to run imagemagick, expect scripts, and other things - probably 1-2 execs/second). I get/set something in ns_cache on almost every request, although the caches themselves are relatively small (200-300k), and I believe I was able to reproduce the problem even with ns_cache disabled. I use postgresql over tcp/ip, 3 pools, 5 connections per pool (I tried Marc's maxidle/open suggestion with no success). And I don't use ADP at all - only tcl/static content.

Thanks for all your help so far.

-Hossein


--
AOLserver - http://www.aolserver.com/

To Remove yourself from this list, simply send an email to <[EMAIL PROTECTED]> 
with the
body of "SIGNOFF AOLSERVER" in the email message. You can leave the Subject: 
field of your email blank.

Reply via email to