Howdy! I'm trying to figure out some tuning for a cluster of postfix servers behind a load balancer. The load balancer simply does a round-robin of 4 nodes, direct TCP passthrough and does not mangle the traffic in any way. We are running RHEL/CentOS 7 packaged Postfix 2.10 currently.
This cluster receives mail from an external Proofpoint cloud service, processes it and passes it on to final delivery. We use an external cloud service for emergency notifications and are testing a new process to send out notification email - on the order of 81,000+ addresses in just a few minutes. What we ran into is that the Proofpoint seemed to connect to one of the four nodes for about 3 minutes - sending about 60 batches of addresses through a single smtpd process. Somewhere in that, the node started telling Proofpoint to back off and deferred around 70,000 messages from the overall batch; initially around 7,000 messages were submitted and delivered. Throughout the day, the remainder of the deferred messages continued to deliver, about 10-20 at a time, until some point overnight all were finally delivered. Needless to say, our emergency management people are unhappy with these results. I've been looking at tuning options in postfix to try and accomplish two things: 1) force the Proofpoint to terminate and reconnect, in the hopes of spreading the load over all four nodes, and 2) allow postfix to accept more messages in a short time when these burst periods hit (once a month for testing, and as needed throughout the year). The setting I'm looking at mostly is smtpd_client_connection_rate_limit which is currently the default of 0. For reasons unclear, the smtpd_client_connection_count_limit was raised from default 50 to 1000 several years ago but the default_process_limit was not increased; we probably need to tune those a bit. Are there any recommendations that could improve our throughput in times of these message bursts? Normal day-to-day traffic flows without issue with our current configuration. I'd appreciate any advice. Note that I will likely not be able to test changes in a large mailing until the end of March, for the next scheduled emergency notification test window. Thanks, RobertC