For many reasons, I've been deploying node local DNS caching for production
servers for a while now.

I can highly recommend CoreDNS for this. It should also provide good
metrics as to the behavior of your central resolvers.

On Thu, Aug 25, 2022, 12:37 PM terrible person <[email protected]>
wrote:

> 1) [image: 2022-08-25_20-10-14.png]
>
> 2) I was checking with tcpdump. Don't know if I'm on pair with your theory
> cause client (blackbox) sending syn immediately after receiving "large" udp
> packet. As I said I don't see this behavior with dig, nor I see the
> truncated flag. UDP response from server is 860 bytes. My hypothesis that
> DNS server is clogging with amount of TCP requests (more than 100 hosts)
> and he resets some of them, then there is 3s TCP timeout, and successful
> retry with new connection after. I will check RST flags from 53 port
> tomorrow on the DNS server host.
>
> 3) Yep, this is something i learned today. I was reading this
> <https://sandilands.info/sgordon/segmentation-offloading-with-wireshark-and-ethtool>
>  article,
> but I don't know, you sure about it? As I understood it, you see incorrect
> checksums with tcpdump, cause of this
> [image: 2022-08-25_20-27-13.png]
> but it has no effect on actual traffic. I observed that that tcpdump shows
> that checksums are incorrect for outgoing upd traffic, but receiver show
> that checksums are fine. Mb I can attach some dumps later.
>
> So for now I see some ways to overcome this:
>
> 1) Somehow decrease DNS response (AUTHORITY SECTION и ADDITIONAL SECTION),
> though I don't know if I can do it (I'm using FreeIPA)
> 2) Make changes on client side, either custom changes to blackbox itself,
> or make architectural changes with spreading probing load on DNS server.
>
> don't know, hard stuff
>
> On Thursday, August 25, 2022 at 6:40:23 PM UTC+10 Brian Candler wrote:
>
>> What is this "DNSLookupDuration3s" you talk about?  Is it an alerting
>> rule?  Can you show the expr?
>>
>> To me, it sounds like the opposite problem.  My guess is that
>> blackbox_exporter is first making a UDP DNS query, and either the query or
>> the response is being blocked. So after 3 seconds it retries with TCP, and
>> that succeeds.
>>
>> You can check this theory using tcpdump (especially if you can do tcpdump
>> on the caching resolver as well).  Do you see an outbound UDP DNS query,
>> but no response? The resolution is to fix the underlying UDP communication
>> problem.
>>
>> Are there any virtual machines involved in this?  That's the one case
>> where I have seen this exact problem before with UDP traffic but not TCP.
>> The packet is sent without a correct UDP checksum, because checksum
>> offloading is enabled and the client expects the NIC to insert a correct
>> one; but the receiver doesn't know this, and just sees a packet with a bad
>> checksum and discards it.
>>
>> The solution, or at least workaround, is to disable UDP transmit checksum
>> offloading on the VM's network interface (probably just the one running
>> blackbox_exporter)
>>
>> Try:
>>     ethtool --offload eth0 tx off
>>
>> and if that doesn't work, also try:
>>     ethtool --offload eth0 gso off gro off tso off
>>
>> On Thursday, 25 August 2022 at 08:34:37 UTC+1 [email protected] wrote:
>>
>>> The blackbox_exporter uses the built-in Go resolver library[0]. The only
>>> options here are which address family you want in return.
>>>
>>> [0]: https://pkg.go.dev/net#Resolver.LookupIP
>>>
>>> On Thu, Aug 25, 2022 at 7:35 AM terrible person <[email protected]>
>>> wrote:
>>>
>>>> Thank you, actually I found out about this behaviour just after I
>>>> posted here.
>>>> Strangely, I don't see tcp connections with either nslookup of dig,
>>>> though response is about 860 bytes, but UDP outgoing traffic is present.
>>>> When I probe with blackbox there is also tcp.
>>>>
>>>> How blackbox performs such probes? In parallel or successively? Is
>>>> there a way to suspend such behaviour, analogue to +notcp option of dig?
>>>>
>>>> On Thursday, August 25, 2022 at 2:03:27 PM UTC+10 [email protected]
>>>> wrote:
>>>>
>>>>> DNS lookups will switch to TCP if the response is larger than can fit
>>>>> in a single packet. But that should happen immediately.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 25, 2022 at 5:56 AM terrible person <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi. I'm currently debugging DNS Lookup warnings (more that 3 sec) and
>>>>>> need to figure out whether our network/our DNS/or exporter is 
>>>>>> misbehaving.
>>>>>> So I'm checking ssh endpoints with tcp module:
>>>>>>
>>>>>> [image: 2022-08-25_13-22-13.png]
>>>>>>
>>>>>> but experience 3+ seconds delay on resolving ssh hostnames, which
>>>>>> triggers alerts DNSLookupDuration3s.
>>>>>>
>>>>>> [image: 2022-08-25_13-26-19.png]
>>>>>> problem looks something like this on different hosts - 3.0s+ seconds
>>>>>> of timeout, which looks very much like a generic tcp timeout.
>>>>>>
>>>>>> I checked on DNS server and yes, after UDP queries there is a TCP DNS
>>>>>> query for A record. I don't  see any UDP checksum corruption or delays 
>>>>>> for
>>>>>> such failover. Is this intended? Can someone help me out on this.
>>>>>>
>>>>>> --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "Prometheus Users" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit
>>>>>> https://groups.google.com/d/msgid/prometheus-users/16d8f137-a7e3-4361-a624-6719d71b1d29n%40googlegroups.com
>>>>>> <https://groups.google.com/d/msgid/prometheus-users/16d8f137-a7e3-4361-a624-6719d71b1d29n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "Prometheus Users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/prometheus-users/5f7bc297-5130-433b-b70c-6de34186a9e8n%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/prometheus-users/5f7bc297-5130-433b-b70c-6de34186a9e8n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/6b0556a4-5b8e-4cc2-a8b4-7ee9b7966310n%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/6b0556a4-5b8e-4cc2-a8b4-7ee9b7966310n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmpGRFmvFWG2_iXj-WkjbXLUiK9SUcOWi2HLVP7erW%3D6Gg%40mail.gmail.com.

Reply via email to