What I see in this discussion is a huge amount of "the grass must be
greener on the other side" syndrome, and hardly any recognition that
every technique has its downsides and complications.

        Sure ;)

MVCC generates dead rows, by its very nature ; however I see two trends in this :

        1* A large transaction that updates/deletes many rows.
For instance suppose you UPDATE an entire table whose size is larger than memory.

Old row versions have to be kept somewhere until commit, be it in the table itself or in some accessory undo-log. So, there will be a lot of harddisk grinding anyway, be it MVCC or Oracle-style, or whatever. MVCC will bloat the table and indexes, then VACUUM will shrink them. Update-in-place systems will bloat an undo log.

It seems to me the current MVCC+VACUUM is the right tool for this job, requiring about the same amount of IO that the others. Vacuum scans sequentially, so it's the best way to process large volumes of data.

        2* Many short transactions update rows in a table
        Like the sessions problem, for instance.

        Current VACUUM sucks for this case, I guess that's known.

        -------

So, we have two different problems, and one tool which is adapted to one problem only. Should the tool (Vacuum) be fixed to handle both problems, making it more complex and difficult to maintain, or should another tool be created specifically for the second problem ? Problem 2 is very different from problem 1. The only case when they meet is when there is a continuous stream of small updates running concurrently with a long transaction.
        So, what is the ideal tool for case 2 ?

We'd want a vacuuming machine that can be run very often, or even better, continuously. The table can be large, larger than the disk cache, so scanning it is not an option. The updates are probably randomly distributed into the table. Therefore, VACUUMing a long time after these transactions are commited and the pages are no longer in cache would require a lot of random seeks, which is also bad.
        Besides, we want this vacuum to be continuous and transparent.

The best time to vacuum pages is, thus, when they are still in the background writer's memory, or the disk cache, waiting to be flushed to disk. There, they can be re-read, vacuumed and re-written with no seek penalty, only additional WAL traffic. However the amount of WAL traffic in bytes/s is less important that the frequency of WAL syncs. Emitting more WAL data shouldn't be a problem if those sync writes are coalesced with the sync writes of current reansactions.

So, I guess the best solution for case 2 is to have the background writer perform on-the-fly VACUUM :

An UPDATE or DELETE transaction hands over dirty pages to be written to the bgwriter. It also tells the bgwriter the ID of the current transaction and flags specifying if they contain candidate dead rows. The bgwriter should have a sufficiently large buffer in order to be able to keep these pages in memory until all the transactions that can see the dead rows in these pages are finished.
        Then, the pages are vacuumed and written.

The key is the size of the buffer. It should be large enough to contain enough pages so that it is actually possible to vacuum something out of them before writing them. However if the buffer capacity is exceeded (for instance, because there is a long running transaction), this is not a problem : the pages are simply written to disk normally, they will contain dead rows, which will need to be handled lated by the standard VACUUM.

I think this would maximize code reuse by using the current bgwriter's architecture... did I miss something ?














---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match

Reply via email to