On 18/06/2015 12:00, Matus UHLAR - fantomas wrote: > On 17.06.15 22:39, Shawn Zhou wrote: >> BIND on my resolvers reaches the max open file limit and I am getting >> lots >> of SERVFAILs >> http://pastebin.com/SxRsHLff > >> After I increased the max-socks (-s 8192) to 8192, I no longer saw the >> file >> limit error from the log anymore; however, I am still many SERVFAILs. > > no other errors? > >> Our resolvers were doing about 15k queries per seconds when this was >> happening and those were legit traffic. I am aware that I am setting >> recursive clients to a very high number. Those resolvers are running on >> 12-cores cpu and 24G RAM hardware. cpu utilization was at about 20% and >> plenty of RAM left. > >> I am wondering if I've reached the limit of BIND for the amount of >> recursive queries it can serve. Any other tunings I should try? > > maybe changing number of recursive-clients, max-clients-per-query. > > Does EDNS work for you? EDNS problems often result to increased number of > TCP queries which slows down resolution ... > >> By the way, the resolvers are running RHEL 6.x. > > precise BIND version would help a bit more... seems RH6.6 contains 9.8.2 > but > that may be different for older RH6 versions. > >
Unless you're running a build with --with-tuning=large (for which there are a number of caveats around the capacity of the machine etc..), then you don't really want to have a backlog of recursive clients that exceeds 3000-3500. If you're getting that many in your backlog, then as already highlighted to you, there is Something Wrong going on. You're probably running into other resource limits that will be what are causing the SERVFAIL responses you're still seeing despite increasing the maximum number of sockets that named can use. I would tune down the limit to 3000 and allow named to drop the oldest outstanding client queries when new ones need to be processed. There is another logging category you can use (query-errors) that can tell you more, but it's probably not worth it in this instance. And I have another suggestion for what might be causing your backlog (apart from problems in the network path between your servers and the Internet authoritative servers), for which we have some soon-to-be-released new mitigation features (in 9.10.3): https://kb.isc.org/article/AA-01178 (this will be updated to reflect the features we will actually include in the upcoming release - but they're essentially going to be fetches-per-server and fetches-per-zone along with with improved logging/stats for both of those) There's going to be a webinar about both the problem and the mitigations on July 8th: https://www.facebook.com/events/100311766979499/ http://goo.gl/Z8idQf Hoping that this is useful? Cathy _______________________________________________ Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from this list bind-users mailing list bind-users@lists.isc.org https://lists.isc.org/mailman/listinfo/bind-users