Hi there list,

this week we stumbled upon an issue where we could not send mail to certain domains, for instance em...@umcg.nl.

Nov 16 17:04:08 mail postfix/smtp[13330]: warning: no MX host for umcg.nl has a 
valid address record
Nov 16 17:04:08 mail postfix/smtp[13330]: 1D1D21422C2: to=<em...@umcg.nl>, 
relay=none, delay=2257, delays=2256/0.02/0.52/0, dsn=4.4.3, status=deferred (Host or 
domain name not found. Name service error for 
name=umcg-nl.mail.protection.outlook.com type=A: Host not found, try again)

It turned out that this was the cause:

  $ dig MX umcg.nl +short
  10 umcg-nl.mail.protection.outlook.com.

  $ dig NS mail.protection.outlook.com. +short
  ns1-proddns.glbdns.o365filtering.com.
  ns2-proddns.glbdns.o365filtering.com.

  $ dig A umcg-nl.mail.protection.outlook.com.  \
      @ns1-proddns.glbdns.o365filtering.com. +edns +dnssec |
    grep FORMERR
  ;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46904
  ;; WARNING: EDNS query returned status FORMERR -
      retry with '+nodnssec +noedns'


Apparently some Microsoft Office 365 mail servers do not support EDNS and return FORMERR. This propagated through our DNS recursors as SERVFAIL and caused the lookup to fail.

A temporary workaround was to preheat the DNS cache by manually querying said domain without EDNS and then flush the queue entries:

  $ dig A umcg-nl.mail.protection.outlook.com. \
      @ns1-proddns.glbdns.o365filtering.com. +noedns +nodnssec +short
  213.199.154.87
  213.199.154.23

  # postqueue -i THE_ITEM

But that's obviously not the right solution.


Some more digging revealed that EDNS was enabled on the query through `smtp_addr_list`:

     else if (smtp_tls_insecure_mx_policy > TLS_LEV_MAY)
        res_opt = RES_USE_DNSSEC;

The USE_DNSSEC causes the subsequent queries to use USE_EDNS0 with the DO flag and that killed our interoperability with the Microsoft Office 365 DNS.

The fix was then to lower `smtp_tls_insecure_mx_policy` from 5 (dane) to 1 (may):

    smtp_tls_dane_insecure_mx_policy=may   # default: dane


For the record, this miscommunication started on our servers since the 2nd of November, according to the logs (although I cannot rule out if anything changed on our side.) Running postfix 3.1.0-3 (Ubuntu Xenial) here.


My questions -- finally:

- Apart from Microsoft upgrading their servers to 2016 and supporting EDNS, is this issue something postfix should handle?

- Would postfix have handled FORMERR but not SERVFAIL and are my caching resolvers to blame?

- Should postfix retry the query without EDNS on unexpected errors?

- Should the default smtp_tls_dane_insecure_mx_policy be set to 'dane'? Or should something more conservative be appropriate if it's able to cause this kind of miscommunication?



Thanks for your input.

Cheers,
Walter Doekes
OSSO B.V.

Reply via email to