We've just upgraded our master KDC from 1.4 to 1.8, and are observing around 2% of all password attempts fail with "Cannot lock database" returned to the user. I'd appreciate any thoughts on how to improve this situation. A rather lengthy discussion follows...
Our setup is such that we have a once every five minute kdb5_util dump for onwards kprop propagation to the slaves. The detail of that dump and prop is very similar to that recommended in the 1.8 installation guide, although for compatibility with our 1.6 slaves we are using the '-r13' option to kdb5_util (I'll describe how it differs later, but this doesn't alter the main point). I should point out that previous to our upgrade we had occasional problems where database updates sometimes took longer than the iptables state tracking timeout, which resulted in even worse problems (where the update succeeded but the kadmin/kpasswd client received an error). The new behaviour is definitely desirable, in that a larger number of errors occur but the error messages actually match reality. But there's still room for improvement. With 1.8, I can see that there is a fixed number of retries defined in src/plugins/kdb/db2/kdb_db2.c (5, 1 second apart) which tallies exactly with our logs (requests coming in 5 seconds or less prior to the end of the dump proceed okay). This is incidentally the same interval/number as in 1.4's krb5_db2_db_put_principal, so I'm not sure why we saw the iptables timeouts based on this analysis. But I digress... Since our database dumps currently take around 12 seconds, I estimate that if we change that number of retries from 5 to 15 we'd almost completely eliminate this problem without introducing unacceptable delays... until our database grows again. Anyhow, I wonder whether we're doing something particularly odd here; we'd obviously like to reduce or completely eliminate users getting this message, but recompiling to change that #define seems wrong. We'd like to move to incremental propagation, ultimately, but this would mean moving our slaves to 1.8 which isn't ideal for us at the moment. We have around 55,000 principals and a database size of around 150MB. Oh, and one final note: another part of the reason this appears to hit us more with 1.8 is because our dump-and-prop is done via a Makefile which only dumps the database if the previous dumpfile is older than the principal database (via a simple Makefile dependency). With 1.8, it looks like (some?) getprinc requests also end up modifying the principal database mtime (log correlation suggests that not all getprincs have this effect, and there is a lag of several seconds; but that's the best idea I've got). I can't spot immediately what in the code is doing this; any ideas? Thanks for reading! -- Dominic Hargreaves, Systems Development and Support Team Computing Services, University of Oxford ________________________________________________ Kerberos mailing list [email protected] https://mailman.mit.edu/mailman/listinfo/kerberos
