Here is a low-level, very detailed description of the implementation of
the autovacuum ideas we have so far.
launcher's dealing with databases
We'll add a new member "nexttime" to the autovac_dbase struct, which
will be the time_t of the next time a worker needs to process that DB.
Initially, those times will be 0 for all databases. The launcher will
keep that list in memory, and on each iteration it will fetch the entry
that has the earliest time, and sleep until that time. When it awakens,
it will start a worker on that database and set the nexttime to
The list will be a Dllist so that it's easy to keep it sorted by
increasing time and picking the head of the list each time, and then
putting that node as a new tail.
Every so often seconds, the launcher will call autovac_get_database_list
and compare that list with the list it has on memory. If a new database
is in the list, it will assign a nexttime between the current instant
and the time of the head of the Dllist. Then it'll put it as the new
head. The new database will thus be put as the next database to be
When a node with nexttime=0 is found, the amount of time to sleep will
be determined as Min(naptime/num_elements, 1), so that initially
databases will be distributed roughly evenly in the naptime interval.
When a nexttime in the past is detected, the launcher will start a
worker either right away or as soon as possible (read below).
launcher and worker interactions
The launcher PID will be in shared memory, so that workers can signal
it. We will also keep worker information in shared memory as an array
of WorkerInfo structs:
We will use SIGUSR1 to communicate between workers and launcher. When
the launcher wants to start a worker, it sets the "dboid" field and
signals the postmaster. Then goes back to sleep. When a worker has
started up and is about to start vacuuming, it will store its PID in
workerpid, and then send a SIGUSR1 to the launcher. If the schedule
says that there's no need to run a new worker, the launcher will go back
We cannot call SendPostmasterSignal a second time just after calling it;
the second call would be lost. So it is important that the launcher
does not try to start a worker until there's no worker starting. So if
the launcher wakes up for any reason and detects that there is a
WorkerInfo entry with valid dboid but workerpid is zero, it will go back
to sleep. Since the starting worker will send a signal as soon as it
finishes starting up, the launcher will wake up, detect this condition
and then it can start a second worker.
Also, the launcher cannot start new workers when there are
autovacuum_max_workers already running. So if there are that many when
it wakes up, it cannot do anything else but go back to sleep again.
When one of those workers finishes, it will wake the launcher by setting
the finished flag on its WorkerInfo, and sending SIGUSR1 to the
launcher. The launcher then wakes up, resets the WorkerInfo struct, and
can start another worker if needed.
There is an additional problem if, for some reason, a worker starts and
is not able to finish its task correctly. It will not be able to set
its finished flag, so the launcher will believe that it's still starting
up. To prevent this problem, we check the PGPROCs of worker processes,
and clean them up if we find they are not actually running (or the PIDs
correspond to processes that are not autovacuum workers). We only do it
if all WorkerInfo structures are in use, thus frequently enough so that
this problem doesn't cause any starvation, but seldom enough so that
it's not a performance hit.
worker to-do list
When each worker starts, it determines which tables to process in the
usual fashion: get pg_autovacuum and pgstat data and compute the
The worker then takes a "snapshot" of what's currently going on in the
database, by storing worker PIDs, the corresponding table OID that's
being currently worked, and the to-do list for each worker.
It removes from its to-do list the tables being processed. Finally, it
writes the list to disk.
The table list will be written to a file in
The file will consist of table OIDs, in the order in which they are
going to be vacuumed.
At this point, vacuuming can begin.
Before processing each table, it scans the WorkerInfos to see if there's
a new worker, in which case it reads its to-do list to memory.
Then it again fetches the tables being processed by other workers in the
same database, and for each other worker, removes from its own in-memory
to-do all those tables mentioned in the other lists that appear earlier
than the current table being processed (inclusive). Then it picks the
next non-removed table in the list. All of this must be done with the
Autovacuum LWLock grabbed in exclusive mode, so that no other worker can
pick the same table (no IO takes places here, because the whole lists
were saved in memory at the start.)
other things to consider
This proposal doesn't deal with the hot tables stuff at all, but that is
very easy to bolt on later: just change the first phase, where the
initial to-do list is determined, to exclude "cold" tables. That way,
the vacuuming will be fast. Determining what is a cold table is still
an exercise to the reader ...
It may be interesting to avoid vacuuming at all when there's a
long-running transaction in progress. That way we avoid wasting I/O for
nothing, for example when there's a pg_dump running.
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings