Hello,

PostgreSQL likes to probe the size of relations with lseek(SEEK_END) a
lot.  For example, a fully prewarmed pgbench -S transaction consists
of recvfrom(), lseek(SEEK_END), lseek(SEEK_END), sendto().  I think
lseek() is probably about as cheap as a syscall can be so I doubt it
really costs us much, but it's still a context switch and it stands
out when tracing syscalls, especially now that all the lseek(SEEK_SET)
calls are gone (commit c24dcd0cfd).

If we had a different kind of buffer mapping system of the kind that
Andres has described, there might be a place in shared memory that
could track the size of the relation.  Even if we could do that, I
wonder if it would still be better to do a kind of per-backend
lock-free caching, like this:

1.  Whenever a file size has been changed by extending or truncating
(ie immediately after the syscall), bump a shared "size_change"
invalidation counter.

2.  Somewhere in SMgrRelation, store the last known size_change
counter and the last known size.  In _mdnblocks(), if the counter
hasn't moved, we can use the cached size and skip the call to
FileSize().

3.  To minimise false cache invalidations (caused by other relations'
size changes), instead of using a single size_change counter in shared
memory, use an array of N and map relation OIDs onto them.

4.  As for memory coherency, I think it might be enough to use uint32
without locks or read barriers on the read size, since you have a view
of memory at least as new as your snapshot (the taking of which
included a memory barrier).  That's good enough because we don't need
to see blocks added after our snapshot was taken (the same assumption
applies today, this just takes further advantage of it), and
truncations can't happen while we have a share lock on the relation
(the taking of which also includes memory barrier, covering the case
where the truncation happened after our snapshot and the acquisition
of the share lock on the relation).  In other words, there is heavy
locking around truncation already, and for extension we don't care
about recent extensions so we can be quite relaxed about memory.
Right?

I don't have a patch for this (though I did once try it as a
throw-away hack and it seemed to work), but I just wanted to share the
idea and see if anyone sees a problem with the logic/interlocking, or
has a better idea for how to do this.  It occurred to me that I might
be missing something or this would have been done already...

-- 
Thomas Munro
http://www.enterprisedb.com

Reply via email to