I've been testing the crash recovery of REL9_2_BETA1, using the same method I posted in the "Scaling XLog insertion" thread. I have the checkpointer occasionally throw a FATAL error, which causes the postmaster to take down all of the other processes (DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.) and initiate recovery.
However, sometimes the automatic recovery never initiates. It looks like the postmaster is waiting for the archiver to exit before it starts recovery, and the archiver is waiting for something, I don't really know what. This happens on about 10% of the crashes on REL9_2_BETA1, although I imagine that number is extremely depend on minutiae of the setup, hardware, and phase of the moon, as it is probably some kind of race. This behavior is also present in 9_1_STABLE, although at a much lower prevalence (about 1%). If fact it seems to go back at least to 8.4.0. If I kill -9 the archiver, then recovery initiates and proceeds as normal. I don't know the best way to tackle this. By staring at the code, by "git bisect" (which is hard to do, because I don't know if the behavior was ever not there, and because the problem only occurs statistically it can take many hours per iteration), or some other method? Thanks, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers