Andres Freund <> writes:
> So, to get to the actual meat: My goal was to essentially get rid of an
> exclusive lock over relation extension alltogether. I think I found a
> way to do that that addresses the concerns made in this thread.

> Thew new algorithm basically is:
> 1) Acquire victim buffer, clean it, and mark it as pinned
> 2) Get the current size of the relation, save buffer into blockno
> 3) Try to insert an entry into the buffer table for blockno
> 4) If the page is already in the buffer table, increment blockno by 1,
>    goto 3)
> 5) Try to read the page. In most cases it'll not yet exist. But the page
>    might concurrently have been written by another backend and removed
>    from shared buffers already. If already existing, goto 1)
> 6) Zero out the page on disk.

> I think this does handle the concurrency issues.

The need for (5) kind of destroys my faith in this really being safe: it
says there are non-obvious race conditions here.

For instance, what about this scenario:
* Session 1 tries to extend file, allocates buffer for page 42 (so it's
now between steps 4 and 5).
* Session 2 tries to extend file, sees buffer for 42 exists, allocates
buffer for page 43 (so it's also now between 4 and 5).
* Session 2 tries to read page 43, sees it's not there, and writes out
page 43 with zeroes (now it's done).
* Session 1 tries to read page 42, sees it's there and zero-filled
(not because anybody wrote it, but because holes in files read as 0).

At this point session 1 will go and create page 44, won't it, and you
just wasted a page.  Now we do have mechanisms for reclaiming such pages
but they may not kick in until VACUUM, so you could end up with a whole
lot of table bloat.

Also, the file is likely to end up badly physically fragmented when the
skipped pages are finally filled in.  One of the good things about the
relation extension lock is that the kernel sees the file as being extended
strictly sequentially, which it should handle fairly well as far as
filesystem layout goes.  This way might end up creating a mess on-disk.

Perhaps even more to the point, you've added a read() kernel call that was
previously not there at all, without having removed either the lseek() or
the write().  Perhaps that scales better when what you're measuring is
saturation conditions on a many-core machine, but I have a very hard time
believing that it's not a significant net loss under less-contended

I'm inclined to think that a better solution in the long run is to keep
the relation extension lock but find a way to extend files more than
one page per lock acquisition.

                        regards, tom lane

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to