On Thu, 15 May 2008, Heikki Linnakangas wrote:
> Is it really safe to update the hint bits in place? If there is a > power cut in the middle of writing a block, is there a guarantee from > the disc that the block will never be garbled?

Don't know, to be honest. We've never seen any reports of corrupted data that would suggest such a problem, but it doesn't seem impossible to me that some exotic storage system might do that.

Hmm. That problem is what WAL full-page-writes is meant to handle, isn't it? So basically, if you're telling people that WAL full-page-writes is safer than partial WAL, because it avoids updating pages in-place, then you shouldn't be updating pages in-place for the hint bits either. You can't win!

In fact, if the tuple's creating transaction has aborted, then the tuple can be vacuumed right there and then before it is even written.

Not if you have any indexes on the table. To vacuum, you'll have to scan all indexes to remove pointers to the tuple.

Ah. Well, would that be so expensive? After all, someone has to do it eventually, and these are index entries that have only just been added anyway.

I can understand index updating being a bit messy in the middle of a checkpoint though, as you would have to write the update to the WAL, which you are checkpointing...

So, I don't know exactly how the WAL updates to indexes work, but my guess is that it has been implemented as "write the blocks that we would change to the WAL". The problem with this is that all the changes to the index are done individually, so there's no easy way to "undo" one of them later on when you find out that the transaction has been aborted during the checkpoint.

An alternative would be to build a "list of changes" in the WAL without actually changing the underlying index at all. When reading the index, you would read the "list" first (which would be in memory, and in an efficient-to-search structure), then read the original index and add the two. Then when checkpointing, vet all the changes against known aborted transactions before making all the changes to the index together. This is likely to speed up index writes quite a bit, and also allow you to effectively vacuum aborted tuples before they get written to the disc.

Matthew

--
Vacuums are nothings. We only mention them to let them know we know
they're there.

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to