Maybe this does reveal something about the caching...
Which might be expected behaviour, but I am not convinced it's useful...

Overnight monitoring has shown that the upstream server does occasionally
send back an incomplete (but perfectly valid) CNAME only response.  Mostly
I can justify the caching behaviour based on the TTLs of the second CNAME
or A record (the server is authoritative for the first CNAME, so that's
always at 3600).

As a slight aside:
dnsmasq sends a query at 22:57:32.599, then again (new transaction id) at
22:57:33.601, and at 22:57:36.601.
This last query gets a response in 0.1 seconds, both the others eventually
come in (incomplete) at 22:57:44.073
I am assuming that dnsmasq ignored these late arrivals (either due to a
default timeout, or just because a better answer has been received - this
would be comparable with behaviour when it queries multiple servers to
decide which is 'best').
In this case we are protected by the fact that the incomplete query takes
far longer than the complete one due to timeouts.

Later though:
At 01:12:47 we are out of TTL, so send a request, and get an incomplete
response... The response only contains the first CNAME, which has a 3600

Then dnsmasq doesn't send another query for an hour - despite the fact that
it doesn't have a "good" answer.
In this case the query it sends after an hour gets incomplete response
again - not good.
Then I lost track because the container got moved to a different host - but
it looks like it was returning incomplete for several hours...

dnsmasq is otherwise well behaved - it is still responding to other queries
just fine, despite being hammered by more than 2k queries/second

Two questions:
 - Is it correct/wanted behaviour to cache an incomplete record like this?
I have no issue caching the cname, but should we keep trying to resolve the
cname to an a record?

 - Why/How does a restart of the querying program change the caching
behaviour of dnsmasq?
Because even if the program is restarted after just a few minutes it
immediately gets better data - my capture from yesterday shows that despite
the fact that the TTL had 2855 seconds (of the 3600 default) left just two
minutes before the first 'new process' request comes in, that new request
triggers an outbound query.



On Wed, 20 Mar 2019 at 23:44, John Robson <jrob...@zenoss.com> wrote:

> It is the idea of caching, but not beyond the record TTL surely? And why
> stop only when I reset another piece of software (whether I do that after 5
> minutes or 4 hours).
> I'm finding that the upstream server is inconsistent in how much
> information it returns - just occasionally not returning anything beyond
> the first CNAME - which means that this is probably passed on to my program
> as such, which means that something else is involved in triggering it...
> I don't expect this to be easy :(
> I think we may have found the application bug (it just doesn't know how to
> handle a non IP address return), but I'd still like to understand the
> behaviour from dnsmasq.
> On Wed, 20 Mar 2019 at 23:30, Geert Stappers <stapp...@stappers.nl> wrote:
>> On Wed, Mar 20, 2019 at 09:00:20PM +0000, John Robson wrote:
>> > Hi,
>> >
>> > I have a library which I think has a bug, but this bug is affecting DNS
>> > queries, and bringing out some odd behaviour in dnsmasq...
>> >
>> > Program is making a query to resolve an address (foo.bar.com)
>> > A normal query results in a CNAME (foo.bar.com.edgekey.net), which
>> results
>> > in another CNAME (e1234.a.akamaiedge.net) which has an A record.
>> >
>> > However every so often dnsmasq returns just the first CNAME.
>> > Note I haven't yet caught it in the act of that first truncated
>> response.
>> > The only thing that makes sense to me is if the edgekey.net name
>> servers
>> > didn't respond in good time... but....
>> >
>> > However the bug in the library then means it asks again, instantly.  and
>> > again... and again....
>> > It manages over 100MB/ minute of DNS requests - dnsmasq answering them
>> all
>> > from the cache (I see *no* external requests for that address).
>> Hey, that is the idea about DNS caching ...
>> > When I restart the program the very first query (identical query as
>> before)
>> > gets a complete answer from dnsmasq.
>> >
>> > What I can't understand is how that restart makes any difference to
>> dnsmasq.
>> > Does dnsmasq have some sort of 'Oh hell the query load is insane I'm
>> just
>> > extending the cache a bit to help' mode which it then escapes from as
>> the
>> > program restarts?
>> > There are no external queries for this name during the period of
>> insanity,
>> > but the first request after does get put to the external name servers.
>> >
>> > I'm running an 'external interface only' capture to try and capture the
>> > initial error condition (which I very much doubt is a problem in
>> dnsmasq),
>> > to see if that can shed some light on the issue.
>> >
>> >
>> > Thoughts? debug hints? laughter?
>> To me it seems that the first DNS request from the application has
>> "recursion".  Upon encountering the bug is doing the app "non
>> recursion". With "recusion" do I mean 'When the reply is not an A-record
>> do a next query'.
>> On debug hints:  Currently is the suspected trigger of the bug
>> a DNS that doesn't respond within good time.  So make a "chain"
>> of DNServers where you control the response time of one.
>> Good luck with it.  And feel welcome to report back.
>> > Cheers,
>> > John
>> Groeten
>> Geert Stappers
>> --
>> Leven en laten leven
>> _______________________________________________
>> Dnsmasq-discuss mailing list
>> Dnsmasq-discuss@lists.thekelleys.org.uk
>> http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss
> --
> *John Robson*


*John Robson*
Dnsmasq-discuss mailing list

Reply via email to