For many reasons, I've been deploying node local DNS caching for production servers for a while now.
I can highly recommend CoreDNS for this. It should also provide good metrics as to the behavior of your central resolvers. On Thu, Aug 25, 2022, 12:37 PM terrible person <[email protected]> wrote: > 1) [image: 2022-08-25_20-10-14.png] > > 2) I was checking with tcpdump. Don't know if I'm on pair with your theory > cause client (blackbox) sending syn immediately after receiving "large" udp > packet. As I said I don't see this behavior with dig, nor I see the > truncated flag. UDP response from server is 860 bytes. My hypothesis that > DNS server is clogging with amount of TCP requests (more than 100 hosts) > and he resets some of them, then there is 3s TCP timeout, and successful > retry with new connection after. I will check RST flags from 53 port > tomorrow on the DNS server host. > > 3) Yep, this is something i learned today. I was reading this > <https://sandilands.info/sgordon/segmentation-offloading-with-wireshark-and-ethtool> > article, > but I don't know, you sure about it? As I understood it, you see incorrect > checksums with tcpdump, cause of this > [image: 2022-08-25_20-27-13.png] > but it has no effect on actual traffic. I observed that that tcpdump shows > that checksums are incorrect for outgoing upd traffic, but receiver show > that checksums are fine. Mb I can attach some dumps later. > > So for now I see some ways to overcome this: > > 1) Somehow decrease DNS response (AUTHORITY SECTION и ADDITIONAL SECTION), > though I don't know if I can do it (I'm using FreeIPA) > 2) Make changes on client side, either custom changes to blackbox itself, > or make architectural changes with spreading probing load on DNS server. > > don't know, hard stuff > > On Thursday, August 25, 2022 at 6:40:23 PM UTC+10 Brian Candler wrote: > >> What is this "DNSLookupDuration3s" you talk about? Is it an alerting >> rule? Can you show the expr? >> >> To me, it sounds like the opposite problem. My guess is that >> blackbox_exporter is first making a UDP DNS query, and either the query or >> the response is being blocked. So after 3 seconds it retries with TCP, and >> that succeeds. >> >> You can check this theory using tcpdump (especially if you can do tcpdump >> on the caching resolver as well). Do you see an outbound UDP DNS query, >> but no response? The resolution is to fix the underlying UDP communication >> problem. >> >> Are there any virtual machines involved in this? That's the one case >> where I have seen this exact problem before with UDP traffic but not TCP. >> The packet is sent without a correct UDP checksum, because checksum >> offloading is enabled and the client expects the NIC to insert a correct >> one; but the receiver doesn't know this, and just sees a packet with a bad >> checksum and discards it. >> >> The solution, or at least workaround, is to disable UDP transmit checksum >> offloading on the VM's network interface (probably just the one running >> blackbox_exporter) >> >> Try: >> ethtool --offload eth0 tx off >> >> and if that doesn't work, also try: >> ethtool --offload eth0 gso off gro off tso off >> >> On Thursday, 25 August 2022 at 08:34:37 UTC+1 [email protected] wrote: >> >>> The blackbox_exporter uses the built-in Go resolver library[0]. The only >>> options here are which address family you want in return. >>> >>> [0]: https://pkg.go.dev/net#Resolver.LookupIP >>> >>> On Thu, Aug 25, 2022 at 7:35 AM terrible person <[email protected]> >>> wrote: >>> >>>> Thank you, actually I found out about this behaviour just after I >>>> posted here. >>>> Strangely, I don't see tcp connections with either nslookup of dig, >>>> though response is about 860 bytes, but UDP outgoing traffic is present. >>>> When I probe with blackbox there is also tcp. >>>> >>>> How blackbox performs such probes? In parallel or successively? Is >>>> there a way to suspend such behaviour, analogue to +notcp option of dig? >>>> >>>> On Thursday, August 25, 2022 at 2:03:27 PM UTC+10 [email protected] >>>> wrote: >>>> >>>>> DNS lookups will switch to TCP if the response is larger than can fit >>>>> in a single packet. But that should happen immediately. >>>>> >>>>> >>>>> >>>>> On Thu, Aug 25, 2022 at 5:56 AM terrible person <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi. I'm currently debugging DNS Lookup warnings (more that 3 sec) and >>>>>> need to figure out whether our network/our DNS/or exporter is >>>>>> misbehaving. >>>>>> So I'm checking ssh endpoints with tcp module: >>>>>> >>>>>> [image: 2022-08-25_13-22-13.png] >>>>>> >>>>>> but experience 3+ seconds delay on resolving ssh hostnames, which >>>>>> triggers alerts DNSLookupDuration3s. >>>>>> >>>>>> [image: 2022-08-25_13-26-19.png] >>>>>> problem looks something like this on different hosts - 3.0s+ seconds >>>>>> of timeout, which looks very much like a generic tcp timeout. >>>>>> >>>>>> I checked on DNS server and yes, after UDP queries there is a TCP DNS >>>>>> query for A record. I don't see any UDP checksum corruption or delays >>>>>> for >>>>>> such failover. Is this intended? Can someone help me out on this. >>>>>> >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "Prometheus Users" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to [email protected]. >>>>>> To view this discussion on the web visit >>>>>> https://groups.google.com/d/msgid/prometheus-users/16d8f137-a7e3-4361-a624-6719d71b1d29n%40googlegroups.com >>>>>> <https://groups.google.com/d/msgid/prometheus-users/16d8f137-a7e3-4361-a624-6719d71b1d29n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>> . >>>>>> >>>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "Prometheus Users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> >>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/prometheus-users/5f7bc297-5130-433b-b70c-6de34186a9e8n%40googlegroups.com >>>> <https://groups.google.com/d/msgid/prometheus-users/5f7bc297-5130-433b-b70c-6de34186a9e8n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- > You received this message because you are subscribed to the Google Groups > "Prometheus Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/prometheus-users/6b0556a4-5b8e-4cc2-a8b4-7ee9b7966310n%40googlegroups.com > <https://groups.google.com/d/msgid/prometheus-users/6b0556a4-5b8e-4cc2-a8b4-7ee9b7966310n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmpGRFmvFWG2_iXj-WkjbXLUiK9SUcOWi2HLVP7erW%3D6Gg%40mail.gmail.com.

