On Wed, Sep 26, 2012 at 04:04:25PM -0500, Nico Williams wrote: > On Wed, Sep 26, 2012 at 1:25 PM, Jack Neely <[email protected]> wrote: > > After spending some quality time with my logs, I do about 1.3 million > > kerberos requests a day or 960/min on average. The incident that took > > out the kerberos servers with an additional 600 hits/min (from the krb > > logs) doesn't even make a spike on my graphs. My late morning usage is > > higher. > > I'm not sure I understood correctly what the incident's symptoms were. > If the symptom was non-responsiveness for a second then it's very > likely (very, very, very likely) that the patches I mentioned earlier > will solve your problem, and the events to correlate the incident to > would be kadmind / kadmin.local / kdb5_util load / kpropd iprop events > -- the longer these events the more likely that the kdc ends up > sleeping for a second at a time.
This definitely seems to explain the lag in responses I've noticed during a kprop operation. Usually I get a response in under a second, but if I hit my KDC during when its receiving a kprop it can be 4 or 5 seconds. The above incident is a single misbehaving client suddenly doing about 600 requests / minute for around 30 minutes. During this window no one else could get a KDC response before the client timed out. I've also noticed that the 1.6.1 version in RHEL 5 is leaking memory. I think I've found my smoking gun here. Large memory consumption is directly related to slower performance in my testing. Thanks a bunch for the pointer to the patch! Jack Neely > The bug -if I'm right that it is the bug affecting you- is that > between 1.5 and 1.10, inclusive, all versions of MIT krb5 used > non-blocking file locking with a three-re-try loop with a 1-second > sleep each go around. This is disastrous, really, but it only bites > when something holds an exclusive lock on the KDB, which would be the > daemons/tools listed above, and since the amount of time spent holding > an exclusive lock on the KDB is generally (always, if you don't use > the kadmin.local lock command) short, you might well be getting lucky > 99.99% of the time and thus not observing any 1- or more second > outages on your KDCs. > > If multiple KDCs are affected at roughly the same time then I'd > suspect iprop. What is the rate of write transactions on your master? > Do the rates of KDC (read) vs. kadm5srv (write) transactions imply > the rate of outages you're experiencing? > > Nico > -- -- Jack Neely <[email protected]> Linux Czar, OIT Campus Linux Services Office of Information Technology, NC State University GPG Fingerprint: 1917 5AC1 E828 9337 7AA4 EA6B 213B 765F 3B6A 5B89 ________________________________________________ Kerberos mailing list [email protected] https://mailman.mit.edu/mailman/listinfo/kerberos
