After doing some more packet captures, it looks like a lot of the queries are 
related to Sophos live protection DNS lookups (lots of queries for, so there are a lot of queries which don't get resolved. We see 
multiple queries for the same name and the resolver seems to retransmit to each 
forwarder when it doesn't get a response, including the non-local ones. So the 
behaviour may be being exacerbated by these non-resolvable queries. Eventually 
after about 10 seconds, the forwarder replies with a SERVFAIL response as it 
eventually gives up trying to get a response from the Sophos name servers.

So now I am not sure if the rtt algorithm is completely at fault here as BIND 
is simply trying additional forwarders in an attempt to resolve the name.

I have seen this live protection stuff going on in quite a few corporates now, 
and each time we have had to raise the recursive-client limit. I don't think 
it's just Sophos that do this, pretty sure I saw this  with McAfee a couple 
years ago too, they seem to use DNS to transmit file name hashes so they can do 
a reputation lookup, but for Sophos they only reply if some kind of action is 
required. There must be many corporates out there that are experiencing issues 
with the way this works, i.e all of a sudden their resolvers stop recursing 
because the recursive client limit is hit.

One account I am working on, the resolvers regularly hit 20,000+ recursive 
clients when they kick of a scheduled virus scan. I wish the anti-virus vendors 
would consider the impact they are having on corporate DNS environments and 
re-think how they implement their reputation lookups, it must be the cause of 
some pretty serious ouages. :-(



