Hi all, about two weeks ago two of my servers with the heaviest load suddenly started taking 30-40 minutes to process some messages.
I am running on OpenBSD 7.1 on two 8-core Xeon D-1541 @ 2.10GHz with software RAID 1 with two SSDs and the standard packages: amavisd-new-2.12.0p0 postfix-3.5.14 clamav-0.104.4 The setup is the standard Postfix using port 10024 into Amavis and reading email back on 10025 as per Amavis documentation. /etc/amavisd.conf: $max_servers = 10; # num of pre-forked children (2..30 is common), -m /etc/postfix/master.cf: amavisfeed unix - - n - 10 lmtp -o lmtp_data_done_timeout=120 -o lmtp_send_xforward_command=yes -o lmtp_tls_note_starttls_offer=no -o disable_dns_lookups=yes -o max_use=20 /etc/postfix/main.cf: # amavisd-new setup using separate Postfix instance content_filter=amavisfeed:[127.0.0.1]:10024 # Concurrency limit *MUST* match master.cf amavisfeed_destination_concurrency_limit = 10 The systems run their own caching resolver (unbound). The symptoms are that the perl process associated with one of the 10 servers suddenly pins a core at 100% and takes 30-40 minutes to return (it does return if left alone so this isn’t a case of a hung process). While attempting to isolate the problem I turned off DKIM signature verification ($enable_dkim_verification = 0;) as it seemed that all problematic emails had DKIM but this did not alleviate the issue. For example I’d see entries like: dkim_sd=20200929:example.net, 300330 ms but then an _identical_ email (sent to a different address) would have: dkim_sd=20200929:example.net, 8746 ms which is a far more reasonable time. After turning off DKIM verification I then tried reducing the lifetime of the amavis processes with: $max_requests = 5; # num of requests before we reap a child I’ve seen entries up to 30 minutes (i.e. 1738400 ms). The only thing which comes to mind is that I automatically update SpamAssassin on a nightly basis using sa-update and, perhaps, a SpamAssassin update now has a test which suddenly takes a very large amount of time. NOTE: all the other perl processes (i.e. 9/10) continue processing email efficiently and fast without any problem whatsoever. I was wondering if anyone is seeing similar behaviour or has any recommendations to debug this further. Currently, I am sorry to admit, I have set up a job which kills the relevant perl process if it has been hogging the CPU for longer than 5 minutes… yes, it is a horrible hack, but it keeps mail flowing… Cheers, Arrigo