On 27/02/2023 23:45, Andres Freund wrote:
On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote:
We also do this in freespace.c and visibilitymap.c:
/* Extend as needed. */
while (fsm_nblocks_now < fsm_nblocks)
{
PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
pg.data, false);
fsm_nblocks_now++;
}
We could use the new smgrzeroextend function here. But it would be better to
go through the buffer cache, because after this, the last block, at
'fsm_nblocks', will be read with ReadBuffer() and modified.
I doubt it's a particularly crucial thing to optimize.
Yeah, it won't make any practical difference to performance. I'm more
thinking if we can make this more consistent with other places where we
extend a relation.
But, uh, isn't this code racy? Because this doesn't go through shared buffers,
there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
that writing pages isn't atomic vs readers. So another connection could
connection could see the new relation size, but a read might return a
partially written state of the page. Which then would cause checksum
failures. And even worse, I think it could lead to loosing a write, if the
concurrent connection writes out a page.
fsm_readbuf and vm_readbuf check the relation size first, with
smgrnblocks(), before trying to read the page. So to have a problem, the
smgrnblocks() would have to already return the new size, but the
smgrread() would not return the new contents. I don't think that's
possible, but not sure.
We could use BulkExtendSharedRelationBuffered() to extend the relation and
keep the last page locked, but the BulkExtendSharedRelationBuffered()
signature doesn't allow that. It can return the first N pages locked, but
there's no way to return the *last* page locked.
We can't rely on bulk extending a, potentially, large number of pages in one
go anyway (since we might not be allowed to pin that many pages). So I don't
think requiring locking the last page is a really viable API.
I think for this case I'd just just use the ExtendRelationTo() API we were
discussing nearby. Compared to the cost of reducing syscalls / filesystem
overhead to extend the relation, the cost of the buffer mapping lookup does't
seem significant. That's different in e.g. the hio.c case, because there we
need a buffer with free space, and concurrent activity could otherwise fill up
the buffer before we can lock it again.
Works for me.
- Heikki