What I see in this discussion is a huge amount of "the grass must be
greener on the other side" syndrome, and hardly any recognition that
every technique has its downsides and complications.
MVCC generates dead rows, by its very nature ; however I see two trends
in this :
1* A large transaction that updates/deletes many rows.
For instance suppose you UPDATE an entire table whose size is larger than
Old row versions have to be kept somewhere until commit, be it in the
table itself or in some accessory undo-log.
So, there will be a lot of harddisk grinding anyway, be it MVCC or
Oracle-style, or whatever. MVCC will bloat the table and indexes, then
VACUUM will shrink them. Update-in-place systems will bloat an undo log.
It seems to me the current MVCC+VACUUM is the right tool for this job,
requiring about the same amount of IO that the others.
Vacuum scans sequentially, so it's the best way to process large volumes
2* Many short transactions update rows in a table
Like the sessions problem, for instance.
Current VACUUM sucks for this case, I guess that's known.
So, we have two different problems, and one tool which is adapted to one
problem only. Should the tool (Vacuum) be fixed to handle both problems,
making it more complex and difficult to maintain, or should another tool
be created specifically for the second problem ?
Problem 2 is very different from problem 1. The only case when they meet
is when there is a continuous stream of small updates running concurrently
with a long transaction.
So, what is the ideal tool for case 2 ?
We'd want a vacuuming machine that can be run very often, or even better,
The table can be large, larger than the disk cache, so scanning it is not
The updates are probably randomly distributed into the table. Therefore,
VACUUMing a long time after these transactions are commited and the pages
are no longer in cache would require a lot of random seeks, which is also
Besides, we want this vacuum to be continuous and transparent.
The best time to vacuum pages is, thus, when they are still in the
background writer's memory, or the disk cache, waiting to be flushed to
disk. There, they can be re-read, vacuumed and re-written with no seek
penalty, only additional WAL traffic. However the amount of WAL traffic in
bytes/s is less important that the frequency of WAL syncs. Emitting more
WAL data shouldn't be a problem if those sync writes are coalesced with
the sync writes of current reansactions.
So, I guess the best solution for case 2 is to have the background writer
perform on-the-fly VACUUM :
An UPDATE or DELETE transaction hands over dirty pages to be written to
the bgwriter. It also tells the bgwriter the ID of the current transaction
and flags specifying if they contain candidate dead rows.
The bgwriter should have a sufficiently large buffer in order to be able
to keep these pages in memory until all the transactions that can see the
dead rows in these pages are finished.
Then, the pages are vacuumed and written.
The key is the size of the buffer. It should be large enough to contain
enough pages so that it is actually possible to vacuum something out of
them before writing them. However if the buffer capacity is exceeded (for
instance, because there is a long running transaction), this is not a
problem : the pages are simply written to disk normally, they will contain
dead rows, which will need to be handled lated by the standard VACUUM.
I think this would maximize code reuse by using the current bgwriter's
architecture... did I miss something ?
---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not