Here is the autovacuum patch I am currently working with. This is
basically the same as the previous patch; I have tweaked the database
list management so that after a change in databases (say a new database
is created or a database is dropped), the list is recomputed to account
for the change, keeping the ordering of the previous list.
Modulo two low probability failure scenarios, I feel this patch is ready
to be applied; I will do so on Friday unless there are objections.
The failure scenarios are detailed in the comment pasted below. I
intend to attack these problems next, but as the first one should be
fairly low probability, I don't think it should bar the current patch
from being applied. (The second problem, which seems to me to be the
most serious, should be easily fixable by checking launch times and
"aborting" processes that took longer than autovacuum_naptime to start).
* Main loop for the autovacuum launcher process.
* The signalling between launcher and worker is as follows:
* When the worker has finished starting up, it stores its PID in wi_workerpid
* and sends a SIGUSR1 signal to the launcher. The launcher then knows that
* the postmaster is ready to start a new worker. We do it this way because
* otherwise we risk calling SendPostmasterSignal() when the postmaster hasn't
* yet processed the last one, in which case the second signal would be lost.
* This is only useful when two workers need to be started close to one
* another, which should be rare but it's possible.
* Additionally, when the worker is finished with the vacuum work, it sets the
* wi_finished flag and sends a SIGUSR1 signal to the launcher. Upon receipt
* of this signal, the launcher then clears the entry for future use and may
* start another worker right away, if need be.
* There is at least one race condition here: if the workers are all busy, a
* database needs immediate attention and a worker finishes just after the
* launcher started a worker and sent the signal to postmaster, but before
* postmaster processes the signal; at this point, the launcher receives a
* signal from the finishing process, sees the empty slot, and sends the
* signal to postmaster again to start another worker. But the postmaster
* SendPostmasterSignal() flag was already set, so the signal is lost. To
* avoid this problem, the launcher should not try to start a new worker until
* all WorkerInfo entries that have the wi_dboid field set have a PID assigned.
* FIXME someday. The problem is that if we have workers failing to start for
* some reason, holding the start of new workers will worsen the starvation by
* disabling the start of a new worker as soon as one worker fails to start.
* So it's important to be able to distinguish a worker that has failed
* starting from a worker that is just taking its little bit of time to do so.
* There is another potential problem if, for some reason, a worker starts and
* is not able to finish correctly. It will not be able to set its finished
* flag, so the launcher will believe that it's still starting up. To prevent
* this problem, we should check the PGPROCs of worker processes, and clean
* them up if we find they are not actually running (or they correspond to
* processes that are not autovacuum workers.) FIXME someday.
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at