On 2015-07-19 11:28:25 -0400, Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > So, to get to the actual meat: My goal was to essentially get rid of an > > exclusive lock over relation extension alltogether. I think I found a > > way to do that that addresses the concerns made in this thread. > > > Thew new algorithm basically is: > > 1) Acquire victim buffer, clean it, and mark it as pinned > > 2) Get the current size of the relation, save buffer into blockno > > 3) Try to insert an entry into the buffer table for blockno > > 4) If the page is already in the buffer table, increment blockno by 1, > > goto 3) > > 5) Try to read the page. In most cases it'll not yet exist. But the page > > might concurrently have been written by another backend and removed > > from shared buffers already. If already existing, goto 1) > > 6) Zero out the page on disk. > > > I think this does handle the concurrency issues. > > The need for (5) kind of destroys my faith in this really being safe: it > says there are non-obvious race conditions here.
It's not simple, I agree. I'm doubtful that an significantly simpler approach exists. > For instance, what about this scenario: > * Session 1 tries to extend file, allocates buffer for page 42 (so it's > now between steps 4 and 5). > * Session 2 tries to extend file, sees buffer for 42 exists, allocates > buffer for page 43 (so it's also now between 4 and 5). > * Session 2 tries to read page 43, sees it's not there, and writes out > page 43 with zeroes (now it's done). > * Session 1 tries to read page 42, sees it's there and zero-filled > (not because anybody wrote it, but because holes in files read as 0). > > At this point session 1 will go and create page 44, won't it, and you > just wasted a page. My local code now recognizes that case and uses the page. We just need to do an PageIsNew(). > Also, the file is likely to end up badly physically fragmented when the > skipped pages are finally filled in. One of the good things about the > relation extension lock is that the kernel sees the file as being extended > strictly sequentially, which it should handle fairly well as far as > filesystem layout goes. This way might end up creating a mess on-disk. I don't think that'll actually happen with any recent filesystems. Pretty much all of them do delayed allocation. But it definitely is a concern with older filesystems. I've just measured and with ext4 the number of extents per segment in a 300GB relation don't show a significant difference when compared between the existing and new code. We could try to address this by optionally using posix_fallocate() to do the actual extension - then there'll not be sparse regions, but actually allocated disk blocks. > Perhaps even more to the point, you've added a read() kernel call that was > previously not there at all, without having removed either the lseek() or > the write(). Perhaps that scales better when what you're measuring is > saturation conditions on a many-core machine, but I have a very hard time > believing that it's not a significant net loss under less-contended > conditions. Yes, this has me worried too. > I'm inclined to think that a better solution in the long run is to keep > the relation extension lock but find a way to extend files more than > one page per lock acquisition. I doubt that'll help as much. As long as you have to search and write out buffers under an exclusive lock that'll be painful. You might be able to make that an infrequent occurrance by extending in larger amounts and entering the returned pages into the FSM, but you'll have rather noticeable latency increases everytime that happens. And not just in the extending relation - all the other relations will wait for the one doing the extending. We could move that into some background process, but at that point things have gotten seriously complex. The more radical solution would be to have some place in memory that'd store the current number of blocks. Then all the extension specific locking we'd need would be around incrementing that. But how and where to store that isn't easy. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers