Hi Mark,
I may have found another (possibly related?) bug:
I noticed that when validating a signed zone using delv by querying a local
BIND caching server (v9.10.3-P4), it sometimes suddenly alerts "no valid
RRSIG”. Indeed, when querying “dig ds mydomain +dnssec", it returns the DS
records, but no RRSIG at all. The following sequence of commands (output
simplified) makes me think this might be related to prefetch/cache expiry as
well (prefetch value 2):
$ while true; do dig ds mydomain; sleep 1; done
;; ANSWER SECTION:
mydomain. 3 IN DS […]
mydomain. 3 IN DS […]
mydomain. 3 IN RRSIG DS […]
;; ANSWER SECTION:
mydomain. 3600IN DS […]
mydomain. 3600IN DS […]
mydomain. 2 IN RRSIG DS […]
;; ANSWER SECTION:
mydomain. 3599IN DS […]
mydomain. 3599IN DS […]
mydomain. 1 IN RRSIG DS […]
;; ANSWER SECTION:
mydomain. 3598IN DS […]
mydomain. 3598IN DS […]
mydomain. 0 IN RRSIG DS […]
;; ANSWER SECTION:
mydomain. 3597IN DS […]
mydomain. 3597IN DS […]
What’s your take on this?
Regards,
Thomas
> On 20.06.2016, at 08:39, Mark Andrews wrote:
>
>
> A fix for this is in review and should be in the next maintainance
> release.
>
> Mark
>
> In message <16a2cdfd-694d-444a-a760-17c9d7517...@open.ch>, Thomas Sturm
> writes:
>>
>> I am now able to reliably reproduce the behaviour with dig querying BIND
>> 9.10.4-P1 (not 9.9, apparently) with "prefetch 0”:
>>
>> $ while true; do dig outlook.office365.com +noauthority +noadditional
>> +tries=1 +retry=0; sleep 0.1; done
>>
>> Wait for 5 minutes, once the TTL expires, this should show about 5-7
>> SERVFAIL responses.
>>
>> prefetch 1 or 2 makes it harder to reproduce and it only happens
>> (sometimes) on loaded systems. prefetch 10 makes it go away.
>>
>> It never happens after restarting or flushing the cache. And it never
>> happens when querying x seconds _after_ the TTL expired. Could there be
>> an issue processing cached client requests during cache expiry, and since
>> it only happens on 9.10, potentially related to prefetching?
>>
>>
>>
>>> On 16.06.2016, at 10:00, Thomas Sturm wrote:
>>>
>>> Hi,
>>>
>>> We are experiencing strange intermittent issues when resolving
>> outlook.office365.com, but also with other domains like e.g.
>> amazonaws.com or snort.org. But let’s choose office365.com as example for
>> now. outlook.office365.com is a CNAME to lb.geo.office365.com, and
>> office365.com delegates the geo subdomain to different nameservers; 2 of
>> them are showing some issues on intodns.com [1] (which may or may not be
>> related to this problem).
>>>
>>> When querying one of the office365.com nameservers, it correctly
>> delegates, as far as I understand:
>>>
>>> # dig a lb.geo.office365.com @ns1.msft.net +noadditional +nostats
>>>
>>> ; <<>> DiG 9.10.4 <<>> a lb.geo.office365.com @ns1.msft.net
>> +noadditional +nostats
>>> ;; global options: +cmd
>>> ;; Got answer:
>>> ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 37098
>>> ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 5
>>> ;; WARNING: recursion requested but not available
>>>
>>> ;; OPT PSEUDOSECTION:
>>> ; EDNS: version: 0, flags:; udp: 4000
>>> ;; QUESTION SECTION:
>>> ;lb.geo.office365.com. IN A
>>>
>>> ;; AUTHORITY SECTION:
>>> geo.office365.com. 300 IN NS
>> glb1.glbdns2.microsoft.com.
>>> geo.office365.com. 300 IN NS ns1.p21.dynect.net.
>>> geo.office365.com. 300 IN NS ns3.p21.dynect.net.
>>> geo.office365.com. 300 IN NS ns4.p21.dynect.net.
>>> geo.office365.com. 300 IN NS ns2.p21.dynect.net.
>>> geo.office365.com. 300 IN NS
>> glb2.glbdns2.microsoft.com.
>>>
>>> Still, BIND (sometimes) decides to return SERVFAIL to the client
>> immediately after receiving this response. Some interesting debug log
>> lines:
>>>
>>> resolver: debug 3: resquery 0x7f26fecc8010 (fctx
>> 0x7f26fecb4458(lb.geo.office365.com/A)): sent
>>> resolver: debug 3: resquery 0x7f26fecc8010 (fctx
>> 0x7f26fecb4458(lb.geo.office365.com/A)): response
>>> resolver: debug 10: received packet:
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> noanswer_response
>>> resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: noanswer_response:
>> lb.geo.office365.com (in 'office365.com'?): 1 172499
>>> resolver: debug 10: log_ns_ttl: fctx 0x7f26fecb4458: DELEGATION:
>> lb.geo.office365.com (in 'geo.office365.com'?): 0 172499
>>> resolver: debug 3: fctx 0x7f26fecb4458(lb.geo.office365.com/A):
>> cache_message
>>> resolver: debug 3: fctx