So, I'm trying to make a new build for our mailserver cluster, and I'm experiencing a strange problem.
This is FreeBSD 8.0, Postfix 2.6.6, Amavisd 2.6.4, Perl 5.10.1, SA 3.3.1, Clamav 0.96, both with p0f enabled and disabled. In tuning the server, to find out the best value for for $max_servers (and the corresponding number of amavisfeed processes in master.cf), I discovered that if at any point the CPU becomes completely taxed (pegged at 100% CPU for a little bit), amavisd seems to enter a tail spin with all children processes blocking on select, introducing some increasing delay from then on (until it's restarted). Once amavised enters this weird state, it will proceed to take longer and longer to process messages eventually taking well over 40 seconds per message, even when ~95% of the CPU is unused and everything on the machine seems to be idling. amavisd-agent indicates this is not time spent in either AV or SA processing, as those continue to happen at the same rate. It's as if amavisd has some sort of backoff code or something, or perhaps there is some bug in the kernel. I'm currently running 8.0-RELEASE, but I'm cvsup'ing and compiling -STABLE as we speak to see if the problems persist in that train. Keep in mind that if I set the $max_servers sufficiently low (such as 10), it never enters this state, and properly processes mail as long as I ask it to. However, it processes SIGNIFICANTLY less mail than the hardware should. It should be able to handle 20-30 max_servers without problem, and even then, increasing max_servers further should result in a levelling off or slight dip in performance, rather than a dramatic eradification of anything resembling performance. Everything will be fine; amavisd is cranking (most of the AMD 64 X2 5200 dual core CPU is being used), ram is available, amavisd-nanny will look healthy such as: PID 42671: 42671-02-40 0:00:02 GS PID 42672: 42672-02-44 0:00:01 =S PID 42673: 42673-01-46 0:00:01 S PID 42674: 42674-02-45 0:00:00 S PID 42675: 42675-02-47 0:00:00 S PID 42676: 42676-02-47 0:00:05 SSSSS PID 42677: 42677-01-45 0:00:00 G PID 42678: 42678-01-41 0:00:07 =SSSSSSS PID 42679: 42679-01-47 0:00:00 S PID 42680: 42680-01-50 0:00:00 S PID 42681: 42681-01-39 0:00:01 =S PID 42683: 42683-01-50 0:00:01 =S And the processes will look like this in 'top' (notice they are all using significant amounts of the CPU, and most are actively in the RUN state): 43937 vscan 1 99 0 69532K 65716K RUN 1 0:03 13.28% perl 43940 vscan 1 73 0 69532K 66356K RUN 1 0:02 12.60% perl 43938 vscan 1 64 0 71580K 68180K RUN 0 0:02 11.87% perl 43939 vscan 1 76 0 70556K 67500K RUN 1 0:02 10.79% perl 43936 vscan 1 76 0 68508K 65344K RUN 0 0:02 10.25% perl 43932 vscan 1 65 0 68508K 65424K CPU0 0 0:02 9.77% perl 43931 vscan 1 76 0 74652K 71212K RUN 1 0:02 9.47% perl 43935 vscan 1 71 0 68508K 65364K RUN 0 0:02 9.47% perl 43934 vscan 1 76 0 68508K 64392K select 0 0:02 8.50% perl 43941 vscan 1 70 0 69532K 66676K select 0 0:02 8.40% perl 43933 vscan 1 97 0 77724K 74140K RUN 1 0:02 6.98% perl 43942 vscan 1 54 0 69532K 65384K select 0 0:01 6.79% perl When things go wrong at some point (the only correlation I can find is a period of time where top reports cpu is 0% idle), amavisd-nanny looks like this (consistently long strings of '='): PID 48416: 48416-03-4 0:00:38 =========:=========:=========:===r> PID 48417: 48417-04-6 0:00:08 ===mmddd PID 48418: 48418-03-8 0:00:39 =========:=========:=========:===Q> PID 48419: 48419-02-18 0:00:36 =========:=========:=========:=QQQ> PID 48420: 48420-02-18 0:00:27 =========:=========:===VVVVS PID 48421: 48421-02-5 0:00:36 =========:=========:=========:=SSr> PID 48422: 48422-02-5 0:00:36 =========:=========:=========:=SSr> PID 48423: 48423-02-4 0:00:38 =========:=========:=========:===r> PID 48424: 48424-02-6 0:00:30 =========:=========:======VVS:S PID 48425: 48425-02-6 0:00:26 =========:=========:=VVVVV PID 48426: . 0:00:45 .........:.........:.........:..... PID 48427: 48427-03-3 0:00:38 =========:=========:=========:===r> PID 48428: 48428-03-6 0:00:37 =========:=========:=========:==rr> PID 48429: 48429-01-8 0:00:41 =========:=========:=========:===Q> PID 48430: 48430-01-20 0:00:22 =========:=======GG:DD And the processes will look like this in 'top' (notice they no longer use any real CPU and are stuck in "select"): 43935 vscan 1 51 0 74652K 71588K select 1 0:09 4.79% perl 43931 vscan 1 46 0 74652K 71720K select 0 0:09 2.88% perl 43937 vscan 1 46 0 75676K 72448K select 0 0:09 1.37% perl 43942 vscan 1 45 0 69532K 66720K select 0 0:07 1.27% perl 43936 vscan 1 44 0 71580K 68684K select 1 0:08 0.88% perl 43939 vscan 1 44 0 71580K 68924K select 0 0:08 0.29% perl 43934 vscan 1 44 0 71580K 67580K select 0 0:09 0.00% perl 43938 vscan 1 44 0 71580K 68736K select 1 0:09 0.00% perl 43940 vscan 1 44 0 69532K 66556K select 0 0:08 0.00% perl 43941 vscan 1 44 0 70556K 67032K select 0 0:08 0.00% perl 43932 vscan 1 44 0 69532K 66364K select 1 0:08 0.00% perl 43933 vscan 1 44 0 77724K 74544K select 1 0:07 0.00% perl While this is happening, however, the box will show a load average of < 1, top shows plenty of memory and cpu is usually 95-97% idle. DB servers are not showing stress. But amavisd is mysteriously taking longer and longer to process messages. Before things go wrong, the processing time looks like a very healthy: TimeElapsedDecoding 12 s 0.057 s/msg (InMsgs) TimeElapsedPenPals 0 s 0.001 s/msg (InMsgs) TimeElapsedReceiving 25 s 0.119 s/msg (InMsgs) TimeElapsedSending 3 s 0.015 s/msg (InMsgs) TimeElapsedSpamCheck 333 s 1.563 s/msg (InMsgs) TimeElapsedTotal 429 s 2.015 s/msg (InMsgs) TimeElapsedVirusCheck 19 s 0.088 s/msg (InMsgs) After things go wrong, things get really weird. Here's the output of amavisd-agent after I let things continue to churn : TimeElapsedDecoding 12 s 0.048 s/msg (InMsgs) TimeElapsedPenPals 45 s 0.175 s/msg (InMsgs) TimeElapsedReceiving 525 s 2.050 s/msg (InMsgs) TimeElapsedSending 372 s 1.453 s/msg (InMsgs) TimeElapsedSpamCheck 479 s 1.870 s/msg (InMsgs) TimeElapsedTotal 2934 s 11.461 s/msg (InMsgs) TimeElapsedVirusCheck 51 s 0.201 s/msg (InMsgs) If I let it keep going and going, TimeElapsedTotal continues to increase rapidly while everything except TimeElapsedReceiving stays relatively constant (in terms of s/msg). TimeElapsedReceiving and TimeElapsedSending do increase slightly over time, while TimeElapsedTotal increases very rapidly over time. In the time it took to write this paragraph we now have: TimeElapsedDecoding 14 s 0.044 s/msg (InMsgs) TimeElapsedPenPals 57 s 0.187 s/msg (InMsgs) TimeElapsedReceiving 867 s 2.843 s/msg (InMsgs) TimeElapsedSending 548 s 1.796 s/msg (InMsgs) TimeElapsedSpamCheck 543 s 1.779 s/msg (InMsgs) TimeElapsedTotal 4541 s 14.890 s/msg (InMsgs) TimeElapsedVirusCheck 53 s 0.173 s/msg (InMsgs) And a couple of minutes later: TimeElapsedDecoding 16 s 0.039 s/msg (InMsgs) TimeElapsedPenPals 88 s 0.219 s/msg (InMsgs) TimeElapsedReceiving 1354 s 3.384 s/msg (InMsgs) TimeElapsedSending 852 s 2.129 s/msg (InMsgs) TimeElapsedSpamCheck 670 s 1.676 s/msg (InMsgs) TimeElapsedTotal 7014 s 17.534 s/msg (InMsgs) TimeElapsedVirusCheck 57 s 0.142 s/msg (InMsgs) This will continue to increase as long as I let it. However, if I merely restart amavisd it will process rapidly for perhaps a minute, and then start degrading. The closer I set the $max_servers to 10, the longer it operates normally until it falls apart. If I leave it at 10, it seems to run forever just fine...just not processing mail as rapidly as it should. Much older hardware (running our older versions of all the software) can handle many more max_servers and messages per second. Does anybody have any idea what is going on here? I'm happy to try any suggested debugging. Thanks, Andy --- Andy Dills Xecunet, Inc. www.xecu.net 301-682-9972 --- ------------------------------------------------------------------------------ _______________________________________________ AMaViS-user mailing list AMaViS-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/amavis-user AMaViS-FAQ:http://www.amavis.org/amavis-faq.php3 AMaViS-HowTos:http://www.amavis.org/howto/