So, I'm trying to make a new build for our mailserver cluster, and I'm 
experiencing a strange problem.

This is FreeBSD 8.0, Postfix 2.6.6, Amavisd 2.6.4, Perl 5.10.1, SA 3.3.1, 
Clamav 0.96, both with p0f enabled and disabled.

In tuning the server, to find out the best value for for $max_servers (and 
the corresponding number of amavisfeed processes in master.cf), I 
discovered that if at any point the CPU becomes completely taxed (pegged 
at 100% CPU for a little bit), amavisd seems to enter a tail spin with all 
children processes blocking on select, introducing some increasing delay 
from then on (until it's restarted). Once amavised enters this weird 
state, it will proceed to take longer and longer to process messages 
eventually taking well over 40 seconds per message, even when ~95% of the 
CPU is unused and everything on the machine seems to be idling. 
amavisd-agent indicates this is not time spent in either AV or SA 
processing, as those continue to happen at the same rate. It's as if 
amavisd has some sort of backoff code or something, or perhaps there is 
some bug in the kernel. I'm currently running 8.0-RELEASE, but I'm 
cvsup'ing and compiling -STABLE as we speak to see if the problems persist 
in that train. 

Keep in mind that if I set the $max_servers sufficiently low (such as 10), 
it never enters this state, and properly processes mail as long as I ask 
it to. However, it processes SIGNIFICANTLY less mail than the hardware 
should.

It should be able to handle 20-30 max_servers without problem, and even 
then, increasing max_servers further should result in a levelling off or 
slight dip in performance, rather than a dramatic eradification of 
anything resembling performance.

Everything will be fine; amavisd is cranking (most of the AMD 64 X2 5200 
dual core CPU is being used), ram is available, amavisd-nanny will look 
healthy such as:

PID 42671: 42671-02-40   0:00:02 GS
PID 42672: 42672-02-44   0:00:01 =S
PID 42673: 42673-01-46   0:00:01 S
PID 42674: 42674-02-45   0:00:00 S
PID 42675: 42675-02-47   0:00:00 S
PID 42676: 42676-02-47   0:00:05 SSSSS
PID 42677: 42677-01-45   0:00:00 G
PID 42678: 42678-01-41   0:00:07 =SSSSSSS
PID 42679: 42679-01-47   0:00:00 S
PID 42680: 42680-01-50   0:00:00 S
PID 42681: 42681-01-39   0:00:01 =S
PID 42683: 42683-01-50   0:00:01 =S

And the processes will look like this in 'top' (notice they are all using 
significant amounts of the CPU, and most are actively in the RUN state):

43937 vscan       1  99    0 69532K 65716K RUN     1   0:03 13.28% perl
43940 vscan       1  73    0 69532K 66356K RUN     1   0:02 12.60% perl
43938 vscan       1  64    0 71580K 68180K RUN     0   0:02 11.87% perl
43939 vscan       1  76    0 70556K 67500K RUN     1   0:02 10.79% perl
43936 vscan       1  76    0 68508K 65344K RUN     0   0:02 10.25% perl
43932 vscan       1  65    0 68508K 65424K CPU0    0   0:02  9.77% perl
43931 vscan       1  76    0 74652K 71212K RUN     1   0:02  9.47% perl
43935 vscan       1  71    0 68508K 65364K RUN     0   0:02  9.47% perl
43934 vscan       1  76    0 68508K 64392K select  0   0:02  8.50% perl
43941 vscan       1  70    0 69532K 66676K select  0   0:02  8.40% perl
43933 vscan       1  97    0 77724K 74140K RUN     1   0:02  6.98% perl
43942 vscan       1  54    0 69532K 65384K select  0   0:01  6.79% perl

When things go wrong at some point (the only correlation I can find is a 
period of time where top reports cpu is 0% idle), amavisd-nanny looks like 
this (consistently long strings of '='):

PID 48416: 48416-03-4    0:00:38 =========:=========:=========:===r>
PID 48417: 48417-04-6    0:00:08 ===mmddd
PID 48418: 48418-03-8    0:00:39 =========:=========:=========:===Q>
PID 48419: 48419-02-18   0:00:36 =========:=========:=========:=QQQ>
PID 48420: 48420-02-18   0:00:27 =========:=========:===VVVVS
PID 48421: 48421-02-5    0:00:36 =========:=========:=========:=SSr>
PID 48422: 48422-02-5    0:00:36 =========:=========:=========:=SSr>
PID 48423: 48423-02-4    0:00:38 =========:=========:=========:===r>
PID 48424: 48424-02-6    0:00:30 =========:=========:======VVS:S
PID 48425: 48425-02-6    0:00:26 =========:=========:=VVVVV
PID 48426: .             0:00:45 .........:.........:.........:.....
PID 48427: 48427-03-3    0:00:38 =========:=========:=========:===r>
PID 48428: 48428-03-6    0:00:37 =========:=========:=========:==rr>
PID 48429: 48429-01-8    0:00:41 =========:=========:=========:===Q>
PID 48430: 48430-01-20   0:00:22 =========:=======GG:DD

And the processes will look like this in 'top' (notice they no longer use 
any real CPU and are stuck in "select"):

43935 vscan       1  51    0 74652K 71588K select  1   0:09  4.79% perl
43931 vscan       1  46    0 74652K 71720K select  0   0:09  2.88% perl
43937 vscan       1  46    0 75676K 72448K select  0   0:09  1.37% perl
43942 vscan       1  45    0 69532K 66720K select  0   0:07  1.27% perl
43936 vscan       1  44    0 71580K 68684K select  1   0:08  0.88% perl
43939 vscan       1  44    0 71580K 68924K select  0   0:08  0.29% perl
43934 vscan       1  44    0 71580K 67580K select  0   0:09  0.00% perl
43938 vscan       1  44    0 71580K 68736K select  1   0:09  0.00% perl
43940 vscan       1  44    0 69532K 66556K select  0   0:08  0.00% perl
43941 vscan       1  44    0 70556K 67032K select  0   0:08  0.00% perl
43932 vscan       1  44    0 69532K 66364K select  1   0:08  0.00% perl
43933 vscan       1  44    0 77724K 74544K select  1   0:07  0.00% perl

While this is happening, however, the box will show a load average of < 1, 
top shows plenty of memory and cpu is usually 95-97% idle. DB servers are 
not showing stress. But amavisd is mysteriously taking longer and longer 
to process messages.


Before things go wrong, the processing time looks like a very healthy:

TimeElapsedDecoding                     12 s      0.057 s/msg (InMsgs)
TimeElapsedPenPals                       0 s      0.001 s/msg (InMsgs)
TimeElapsedReceiving                    25 s      0.119 s/msg (InMsgs)
TimeElapsedSending                       3 s      0.015 s/msg (InMsgs)
TimeElapsedSpamCheck                   333 s      1.563 s/msg (InMsgs)
TimeElapsedTotal                       429 s      2.015 s/msg (InMsgs)
TimeElapsedVirusCheck                   19 s      0.088 s/msg (InMsgs)

After things go wrong, things get really weird. Here's the output of 
amavisd-agent after I let things continue to churn :

TimeElapsedDecoding                     12 s      0.048 s/msg (InMsgs)
TimeElapsedPenPals                      45 s      0.175 s/msg (InMsgs)
TimeElapsedReceiving                   525 s      2.050 s/msg (InMsgs)
TimeElapsedSending                     372 s      1.453 s/msg (InMsgs)
TimeElapsedSpamCheck                   479 s      1.870 s/msg (InMsgs)
TimeElapsedTotal                      2934 s     11.461 s/msg (InMsgs)
TimeElapsedVirusCheck                   51 s      0.201 s/msg (InMsgs)

If I let it keep going and going, TimeElapsedTotal continues to increase 
rapidly while everything except TimeElapsedReceiving stays relatively 
constant (in terms of s/msg). TimeElapsedReceiving and TimeElapsedSending 
do increase slightly over time, while TimeElapsedTotal increases very 
rapidly over time. In the time it took to write this paragraph we now 
have:

TimeElapsedDecoding                     14 s      0.044 s/msg (InMsgs)
TimeElapsedPenPals                      57 s      0.187 s/msg (InMsgs)
TimeElapsedReceiving                   867 s      2.843 s/msg (InMsgs)
TimeElapsedSending                     548 s      1.796 s/msg (InMsgs)
TimeElapsedSpamCheck                   543 s      1.779 s/msg (InMsgs)
TimeElapsedTotal                      4541 s     14.890 s/msg (InMsgs)
TimeElapsedVirusCheck                   53 s      0.173 s/msg (InMsgs)

And a couple of minutes later:

TimeElapsedDecoding                     16 s      0.039 s/msg (InMsgs)
TimeElapsedPenPals                      88 s      0.219 s/msg (InMsgs)
TimeElapsedReceiving                  1354 s      3.384 s/msg (InMsgs)
TimeElapsedSending                     852 s      2.129 s/msg (InMsgs)
TimeElapsedSpamCheck                   670 s      1.676 s/msg (InMsgs)
TimeElapsedTotal                      7014 s     17.534 s/msg (InMsgs)
TimeElapsedVirusCheck                   57 s      0.142 s/msg (InMsgs)

This will continue to increase as long as I let it. However, if I merely 
restart amavisd it will process rapidly for perhaps a minute, and then 
start degrading. The closer I set the $max_servers to 10, the longer it 
operates normally until it falls apart. If I leave it at 10, it seems to 
run forever just fine...just not processing mail as rapidly as it should. 
Much older hardware (running our older versions of all the software) can 
handle many more max_servers and messages per second.


Does anybody have any idea what is going on here? I'm happy to try any 
suggested debugging.

Thanks,
Andy

---
Andy Dills
Xecunet, Inc.
www.xecu.net
301-682-9972
---

------------------------------------------------------------------------------

_______________________________________________
AMaViS-user mailing list
AMaViS-user@lists.sourceforge.net 
https://lists.sourceforge.net/lists/listinfo/amavis-user 
 AMaViS-FAQ:http://www.amavis.org/amavis-faq.php3 
 AMaViS-HowTos:http://www.amavis.org/howto/ 

Reply via email to