On 10/19/2022 3:10 PM, Eric Wilkison wrote:
We've got a pool of servers running postfix. Each server is running bind to
cache DNS queries. We are running into an issue where DNS queries are
intermittently failing (beyond scope for this discussion). When this happens
multiple times consecutively postfix starts queueing ALL mail that would go to
this destination for exactly 5 minutes.
For example: bind, with query logging turned on, shows several of these logs:
Oct 19 11:53:12 hkglppfpool4 named[206415]: client @0x7f32b806b440
127.0.0.1#53827 (cluster9out.us.messagelabs.com): query failed (SERVFAIL) for
cluster9out.us.messagelabs.com/IN/A at ../../../bin/named/query.c:8580
At the same time Postfix logs:
Oct 19 11:53:12 hkglppfpool4 postfix/smtp[131030]: 4MspyQ3Fm6z511Sx:
to=<tengyilian1428...@126.com>, relay=none, delay=10, delays=0.14/0/10/0,
dsn=4.4.3, status=deferred (Host or domain name not found. Name service error for
name=cluster9out.us.messagelabs.com type=A: Host not found, try again)
When this happens postfix starts deferring ALL mail that should be delivered to
cluster9out.us.messagelabs.com for exactly 300 seconds. The named query logs
show no queries for this hostname for those 5 minutes, Postfix is not even
trying the lookup any more. After the 5 minutes are up, new messages routing
to cluster9out.us.messagelabs.com are delivered without being deferred and the
queued messages begin to go out.
Testing shows that the DNS issue is very short term, lasting for 1 second or
so. However the pool of servers can handle a large number of messages in a
short time period. The particular combination of events amplifies the short
term DNS issue to messages queueing for 5 minutes. We've seen the queues get
up over 1000 messages before the 5 minutes are up. Above is just one example.
We're seeing these delivery delays going to several different host.
The correct solution is to fix the underlying DNS issue. However until then
we'd like to mitigate the consequences. Are there configuration options that
will
a) adjust the number of DNS failures before postfix starts deferring the
messages
b) adjust the timeout before postfix stops queueing messages
Thanks,
Eric Wilkison
Please see
http://www.postfix.org/QSHAPE_README.html
http://www.postfix.org/TUNING_README.html#hammer
With particular attention to the section:
http://www.postfix.org/QSHAPE_README.html#backlog
Likely setting a custom transport for that destination with a high
destination concurrency and high failed cohort setting will reduce
the pain of these temporary errors.
Unless this is the *only* destination, probably shouldn't adjust the
queue run parameters.
-- Noel Jones