"Heikki Linnakangas" <[EMAIL PROTECTED]> writes:

> There is one wacky idea I haven't dared to propose yet:
> We could lift the limitation that you can't defragment a page that's
> pinned, if we play some smoke and mirrors in the buffer manager. When
> you prune a page, make a *copy* of the page you're pruning, and keep
> both versions in the buffer cache. Old pointers keep pointing to the old
> version. Any new calls to ReadBuffer will return the new copy, and the
> old copy can be dropped when its pin count drops to zero.

Fwiw when Heikki first mentioned this idea I thought it was the craziest thing
I ever heard. But the more I thought about it the more I liked it. I've come
to the conclusion that while it's a wart, it's not much worse than the wart of
the super-exclusive lock which it replaces. In fact it's arguably cleaner in
some ways.

As a result vacuum would never have to wait for arbitrarily long pins and
there wouldn't be the concept of a vacuum waiting for a vacuum lock with
strange lock queueing semantics. It also means we could move tuples around on
the page more freely.

The only places which would have to deal with a possible new buffer would be
precisely those places that lock the page. If you aren't locking the page then
you definitely aren't about to fiddle with any bits that matter since your
writes could be lost. Certainly you're not about to set xmin or xmax or
anything like that. 

You might set hint bits which would be lost but probably not often since you
would have already checked the visibility of the tuples with the page locked.
There may be one or two places where we fiddle bits for a tuple we've just
inserted ourselves thinking nobody else can see it yet, but the current
philosophy seems to be leaning towards treating such coding practices as
unacceptably fragile anyways.

The buffer manager doesn't really need to track multiple "versions" of pages.
It would just mark the old version as an orphaned buffer which is
automatically a victim for the clock sweep as soon as the pin count drops to
0. It will never need to return such a buffer. What we would need is enough
information to reread the buffer if someone tries to lock it and to unpin it
when someone unpins a newer version.

At first I thought the cost of copying the page would be a downside but in
fact Heikki pointed out that in defragmentation we're already copying the
page. In fact copying it to new memory instead of memory which is almost
certainly likely in processor caches which would need to be invalidated would
actually be faster and avoiding the use of memmove could be faster too.

  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to