[exim] Diagnosing delay in retrying

Paul Warren Thu, 01 Oct 2009 12:57:55 -0700

We have three servers.  One server generates a lot of mail and uses a  
pair of servers as a smart host.  The two servers are addressed under  
the same name (mx.mythic-beasts.com), so the config on the sending  
server looks like this:


smarthost:
   driver = manualroute
   route_data = mx.mythic-beasts.com
   transport = remote_smtp

where:

$ host mx.mythic-beasts.com
mx.mythic-beasts.com has address 93.93.131.52
mx.mythic-beasts.com has address 93.93.130.6

Every now and again, exim on the sending server decides that it can't  
send mail, and starts queuing mail.  Looking at the logs, it appears  
to be triggered by a connection time out:

2009-09-29 20:26:59 1MsiIf-0002cW-Ps == [email protected] R=smarthost  
T=remote_smtp defer (110): Connection timed out

and that will then be followed by lots of non-retries:

2009-09-29 20:26:59 1MsiLb-0003f7-DW == [email protected] R=smarthost  
T=remote_smtp defer (-53): retry time not reached for any host

Exim then appears to refuse to retry for an unreasonably long period  
of time.  For example, exim successfully sends a mail at 20:54.   It  
then receives a number of time outs up to 20:58.  Then, it does not  
appear to retry until 04:57 the following morning, despite logging a  
"defer (-53): retry time not reached for any host" many times every  
minute for the whole of that period.

Our retry configuration says:

begin retry

# Only retry bounce delivery once every 12 hours, for 4 days.
*                      *                senders=:           F,4d,12h

# Everything else, try once every 15 minutes for 12 hours, then once  
an hour,
# increasing by 150% each time, for 16 hours; then once every 8 hours  
for 4
# days.
*                      *                                    F,12h,15m;  
G,16h,1h,1.5; F,4d,8h

A couple of questions:

1. Why doesn't it retry during that 8 hour period?  Surely the  
successful send at 20:54 should reset the retry rules?

2. Does setting route_data to an A record with multiple IPs achieve  
the redundancy I'm looking for?  As far as I can tell, exim makes no  
attempt to fall back on the second IP after the connection failure: it  
hadn't seen a connection failure on the other IP for around 3 hours  
prior to going into "won't send any mail" mode.

I'm separately trying to get to the bottom of why we're seeing the  
connection refusal in the first place, but I'd like to understand why  
our setup isn't as robust as I think it should be.

many thanks,

Paul

-- 
## List details at http://lists.exim.org/mailman/listinfo/exim-users 
## Exim details at http://www.exim.org/
## Please use the Wiki with this list - http://wiki.exim.org/

[exim] Diagnosing delay in retrying

Reply via email to