*TL;DR*: Kubernetes dnsPolicy: ClusterFirst can become a bottleneck with a high rate of outbound connections. It seems like the problem is filling the nf_conntrack table, causing client applications to fail to do DNS lookups. I resolved this problem by switching my application to dnsPolicy: Default, which provided much better performance for my application that does not need cluster DNS.
It seems like this is probably a "known" problem (see issues below), but I can't tell: Is there a solution being worked on for this? Thanks! *Details*: We were running a load generator, and were surprised to find that the aggregate rate did not increase as we added more instances and nodes to our cluster (GKE 1.7.6-gke.1). Eventually the application started getting errors like "Name or service not known" at surprisingly low rates, like ~1000 requests/second. Switching the application to dnsPolicy: Default resolved the issue. I spent some time digging into this, and the problem is not the CPU utilization kube-dns / dnsmasq itself. On my small cluster of ~10 n1-standard-1 instances, I can get about 80000 cached DNS queries/second. I *think* the issue is that when there are enough machines talking to this single DNS server, it fills the nf_conntrack table, causing packets to get dropped, which I believe ends up rate limiting the clients. dmesg on the node that is running kube-dns shows a constant stream of: [1124553.016331] nf_conntrack: table full, dropping packet [1124553.021680] nf_conntrack: table full, dropping packet [1124553.027024] nf_conntrack: table full, dropping packet [1124553.032807] nf_conntrack: table full, dropping packet It seems to me that this is a bottleneck for Kubernetes clusters, since by default all queries are directed to a small number of machines, which will then fill the connection tracking tables. Is there a planned solution to this bottleneck? I was very surprised that *DNS* would be my bottleneck on a Kubernetes cluster, and at shockingly low rates. *Related Github issues* The following Github issues may be related to this problem. They all have a bunch of discussion but no clear resolution: Run kube-dns on each node: https://github.com/kubernetes/kubernetes/issues/45363 Run dnsmasq on each node; mentions conntrack: https://github.com/kubernetes/kubernetes/issues/32749 kube-dns should be a daemonset / run on each node https://github.com/kubernetes/kubernetes/issues/26707 dnsmasq intermittent connection refused: https://github.com/kubernetes/kubernetes/issues/45976 Intermitted DNS to external name: https://github.com/kubernetes/kubernetes/issues/47142 kube-aws seems to already do something to run a local DNS resolver on each node? https://github.com/kubernetes-incubator/kube-aws/pull/792/ -- You received this message because you are subscribed to the Google Groups "Kubernetes user discussion and Q&A" group. To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-users+unsubscr...@googlegroups.com. To post to this group, send email to kubernetes-users@googlegroups.com. Visit this group at https://groups.google.com/group/kubernetes-users. For more options, visit https://groups.google.com/d/optout.