At Mon, 04 Aug 2008 09:24:51 -0500, Walter Gould <[EMAIL PROTECTED]> wrote:
> > If so, and if servers always return SERVFAILs to any query, it may > > indicate a different type of problem than merely exhausting file > > descriptors. If you also serve an authoritative zone in that server, > > you may want to check whether queries for names in the authoritative > > zone are responded. > When the servfail errors occur, I want to *think* that we are able to > still resolve names for our authoritative zone. However - I will need to > test this again to be sure though. When the servfail errors happen - we > definitely cannot resolve queries for names we are not authoritative for. > > Also, when this happens, I notice that the output from "lsof | grep > named | wc -l" jumps from around 40 to ~1000. Do you believe this is > related to the errors - or just a coincidence that this number rises > when we are having problems resolving external names? I think these 1000 descriptors are related to the trouble you're seeing, but I have no idea about how exactly they caused the problem. One thing that looks strange to me is that the server seems to have only about 1000 sockets even with the larger ISC_SOCKET_FDSETSIZE (but this may be because the server simply needed that number of sockets). Another thing that looks strange to me is that the server reportedly keeps returning SERVFAILs with having a large number of sockets even though the CPU load is not high. I guess we need more information to diagnose: - your detailed configuration (named.conf) - output of initial log message (output of named -g before it starts accepting queries) - output of 'rndc status' while the trouble is happening - output of 'rndc recursing' while the trouble is happening > > You may also want to check whether the server > > returns a query for "version.bind TXT CH" (after configuring the > > server to respond to it). > > > What would this tell or buy me? Currently, the version is removed from > our named.conf file. It will tell you whether the problem is cache specific or about the server as a whole. If you manage an authoritative zone in the same server, you can do the same test with it. --- JINMEI, Tatuya Internet Systems Consortium, Inc.
