Hans-Juergen Schoenig wrote:
Last week we have seen a problem with some horribly configured machine.
The disk filled up (bad FSM ;) ) and once this happened the sysadmi killed the system (-9). After two days PostgreSQL has still not started up and they tried to restart it again and again making sure that the consistency check was started over an over again (thus causing more and more downtime). From the admi point of view there was no way to find out whether the machine was actually dead or still recovering.

Here is a small patch which issues a log message indicating that the recovery process can take ages.
Maybe this can prevent some admis from interrupting the recovery process.

Wait, are you saying that the time was spent in the rm_cleanup phase? That sounds unbelievable. Surely the time was spent in the redo phase, no?

In our case, the recovery process took 3.5 days !!

That's a ridiculously long time. Was this a normal recovery, not a PITR archive recovery? Any idea why the recovery took so long? Given the max. checkpoint timeout of 1h, I would expect that the recovery would take a maximum of few hours even with an extremely write-heavy workload.

  Heikki Linnakangas
