On Wed, Sep 26, 2012 at 1:25 PM, Jack Neely <[email protected]> wrote: > After spending some quality time with my logs, I do about 1.3 million > kerberos requests a day or 960/min on average. The incident that took > out the kerberos servers with an additional 600 hits/min (from the krb > logs) doesn't even make a spike on my graphs. My late morning usage is > higher.
I'm not sure I understood correctly what the incident's symptoms were. If the symptom was non-responsiveness for a second then it's very likely (very, very, very likely) that the patches I mentioned earlier will solve your problem, and the events to correlate the incident to would be kadmind / kadmin.local / kdb5_util load / kpropd iprop events -- the longer these events the more likely that the kdc ends up sleeping for a second at a time. The bug -if I'm right that it is the bug affecting you- is that between 1.5 and 1.10, inclusive, all versions of MIT krb5 used non-blocking file locking with a three-re-try loop with a 1-second sleep each go around. This is disastrous, really, but it only bites when something holds an exclusive lock on the KDB, which would be the daemons/tools listed above, and since the amount of time spent holding an exclusive lock on the KDB is generally (always, if you don't use the kadmin.local lock command) short, you might well be getting lucky 99.99% of the time and thus not observing any 1- or more second outages on your KDCs. If multiple KDCs are affected at roughly the same time then I'd suspect iprop. What is the rate of write transactions on your master? Do the rates of KDC (read) vs. kadm5srv (write) transactions imply the rate of outages you're experiencing? Nico -- ________________________________________________ Kerberos mailing list [email protected] https://mailman.mit.edu/mailman/listinfo/kerberos
