I think that the source of the problem here is

> ;; Truncated, retrying in TCP mode.


Dnsmasq is forwarding the query via UDP, and getting a reply, but it has
the bit set which says "the reply is too big for your UDP packet, try
again using TCP."

That gets returned to your requester, which does the right thing and
sends the query again to dnsmasq via TCP. Dnsmasq attempts to connect to
the first upstream server by TCP, but that server is down. By the time
the kernel times out and dnsmasq tries to connect to the second server,
the original requester has already timed out and given up.

The best fix to this is probably to tackle the too-big-reply problem.
The replies you are getting are 528 bytes, which is just bigger than the
minimum which must be supported by any DNS server. You can either remove
one of the five SRV records, or change the server config to support
EDNS0 and >512 byte replies.

Note that it's possible that the upstream server is truncating not
because of its config but because the _queries_ you are sending don't
have EDNS0 specifying a larger than 512 byte allowable reply. Modern
"dig" implementations normally do this by default, but yours doesn't
seem to be. Of course fixing dig doesn't fix the problem if the "real"
queries, rather than test ones from dig, have the same problem.

If none of that works, you are in to sysctl hacking to reduce the
timeout on TCP connection setup.


Cheers,

Simon.



On 14/08/18 18:05, Warner, Andrew C [CTO] wrote:
> Subject: DNSMASQ failing to return SRV records with loss of
> communication to a single DNS server
> 
>  
> 
> Issue:  We have SIP SRV records for a domain which can be provided by
> two DNS servers in our environment.  During testing we have noticed that
> if one of the DNS servers is un-reachable, the request for the SRV
> records via dnsmasq times out.
> 
>  
> 
> This only happens when the query is originated from outside the box
> where dnsmasq is running.  IE – if we issue the SRV query from the
> dnsmasq server, the SRV records are returned.  If we issue the request
> from a client VM which is set to resolve queries against our dnsmasq
> host – the request times out.
> 
>  
> 
> Note:  some of the information below has been changed/replaced with xxx,
>  such as IP addresses and domain names for security reasons.
> 
>  
> 
> Dnsmasq.conf has the following entries – indicating to forward requests
> for labdomain.net to 10.xx.xx.12 and 10.xx.xx.20.   
> 
> server=/labdomain.net/10.xx.xx.12
> 
> server=/labdomain.net/10.xx.xx.20
> 
>  
> 
> VM making SRV queries is 10.xx.xx.99
> 
>  
> 
>  
> 
> *When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ
> server, and have commented out the non-reachable DNS server: 10.xx.xx.12
> – we receive a response to the SRV query.*
> 
>  
> 
> #server=/labdomain.net/10.xx.xx.12
> 
> server=/labdomain.net/10.xx.xx.20
> 
>  
> 
>  
> 
> [labuser@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net
> @10.xx.xx.5
> 
> ;; Truncated, retrying in TCP mode.
> 
>  
> 
> ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv
> _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
> 
> ;; global options: +cmd
> 
> ;; Got answer:
> 
> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14584
> 
> ;; flags: qr aa; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 5
> 
>  
> 
> ;; QUESTION SECTION:
> 
> ;_sip._udp.scscf.sprout.lp.labdomain.net. IN SRV
> 
>  
> 
> ;; ANSWER SECTION:
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-05.labdomain.net.
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-01.labdomain.net.
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-02.labdomain.net.
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-03.labdomain.net.
> 
> _sip._udp.scscf.sprout.lp.labdomain.net. 15 IN SRV 10 50 5054
> ovpklp-viscscf-spn-04.labdomain.net.
> 
>  
> 
> ;; ADDITIONAL SECTION:
> 
> ovpklp-viscscf-spn-05.labdomain.net. 43200 IN A 10.xx.xx.18
> 
> ovpklp-viscscf-spn-01.labdomain.net. 43200 IN A 10.xx.xx.14
> 
> ovpklp-viscscf-spn-02.labdomain.net. 43200 IN A 10.xx.xx.15
> 
> ovpklp-viscscf-spn-03.labdomain.net. 43200 IN A 10.xx.xx.16
> 
> ovpklp-viscscf-spn-04.labdomain.net. 43200 IN A 10.xx.xx.17
> 
>  
> 
> ;; Query time: 2 msec
> 
> ;; SERVER: 10.xx.xx.5#53(10.xx.xx.5)
> 
> ;; WHEN: Mon Aug 13 16:34:40 2018
> 
> ;; MSG SIZE  rcvd: 528
> 
>  
> 
>  
> 
>  
> 
> *When we query for an SRV record with 10.xx.xx.5 being our DNSMASQ
> server, and have both the good and non-reachable DNS server in play – we
> receive a timeout to the SRV query.  In this case – 10.xx.xx.20 is fully
> capable of responding to the SRV query.*
> 
>  
> 
> server=/labdomain.net/10.xx.xx.12        ß not reachable
> 
> server=/labdomain.net/10.xx.xx.20
> 
>  
> 
> [labuser@f5-test ~]$ dig srv _sip._udp.scscf.sprout.lp.labdomain.net
> @10.xx.xx.5
> 
> ;; Truncated, retrying in TCP mode.
> 
>  
> 
> ; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.62.rc1.el6_9.5 <<>> srv
> _sip._udp.scscf.sprout.lp.labdomain.net @10.xx.xx.5
> 
> ;; global options: +cmd
> 
> ;; connection timed out; no servers could be reached
> 
>  
> 
>  
> 
> Dnsmasq logging shows:
> 
>  
> 
> Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: query[SRV]
> _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
> 
> Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded
> _sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.12
> 
> Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: forwarded
> _sip._udp.scscf.sprout.lp.labdomain.net to 10.xx.xx.20
> 
> Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5161]: nameserver
> 10.xx.xx.20 refused to do a recursive query
> 
> Aug 14 16:22:14 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5172]: query[SRV]
> _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
> 
> Aug 14 16:22:24 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5173]: query[SRV]
> _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
> 
> Aug 14 16:22:34 vsmslp-az2-dev-dnsmasq1-mgt dnsmasq[5174]: query[SRV]
> _sip._udp.scscf.sprout.lp.labdomain.net from 10.xx.xx.99
> 
>  
> 
>  
> 
> I could use some ideas on how to further troubleshoot this issue.
> 
>  
> 
>  
> 
>  
> 
>  
> 
> *Andy Warner*
> 
> Telecom Design Engineer
> 
> O: 406-752-3330 / M: 913-972-7521
> 
> andrew.c.war...@sprint.com
> 
> cid:408000_086801428601138001@pvmxe13g01
> 
>  
> 
> 
> 
> _______________________________________________
> Dnsmasq-discuss mailing list
> Dnsmasq-discuss@lists.thekelleys.org.uk
> http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
> 


_______________________________________________
Dnsmasq-discuss mailing list
Dnsmasq-discuss@lists.thekelleys.org.uk
http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss

Reply via email to