On Mon, Jan 30, 2017 at 9:15 PM, Peter Geoghegan <p...@bowt.ie> wrote: >> IIUC worker_wait() is only being used to keep the worker around so its >> files aren't deleted. Once buffile cleanup is changed to be >> ref-counted (in an on_dsm_detach hook?) then workers might as well >> exit sooner, freeing up a worker slot... do I have that right? > > Yes. Or at least I think it's very likely that that will end up happening.
I've looked into this, and have a version of the patch where clean-up occurs when the last backend with a reference to the BufFile goes away. It seems robust; all of my private tests pass, which includes things that parallel CREATE INDEX won't use, and yet is added as infrastructure (e.g., randomAccess recycling of blocks by leader from workers). As Thomas anticipated, worker_wait() now only makes workers wait until the leader comes along to take a reference to their files, at which point the worker processes can go away. In effect, the worker processes go away as soon as possible, just as the leader begins its final on-the-fly merge. At that point, they could be reused by some other process, of course. However, there are some specific implementation issues with this that I didn't quite anticipate. I would like to get feedback on these issues now, from both Thomas and Robert. The issues relate to how much the patch can or should "buy into resource management". You might guess that this new resource management code is something that should live in fd.c, alongside the guts of temp file resource management, within the function FileClose(). That way, it would be called by every possible path that might delete a temp file, including ResourceOwnerReleaseInternal(). That's not what I've done, though. Instead, refcount management is limited to a few higher level routines in buffile.c. Initially, resource management in FileClose() is made to assume that it must delete the file. Then, if and when directed to by BufFileClose()/refcount, a backend may determine that it is not its job to do the deletion -- it will not be the one that must "turn out the lights", and so indicates to FileClose() that it should not delete the file after all (it should just release vFDs, close(), and so on). Otherwise, when refcount reaches zero, temp files are deleted by FileClose() in more or less the conventional manner. The fact that there could, in general, be any error that causes us to attempt a double-deletion (deletion of a temp file from more than one backend) for a time is less of a problem than you might think. This is because there is a risk of this only for as long as two backends hold open the file at the same time. In the case of parallel CREATE INDEX, this is now the shortest possible period of time, since workers close their files using BufFileClose() immediately after the leader wakes them up from a quiescent state. And, if that were to actually happen, say due to some random OOM error during that small window, the consequence is no worse than an annoying log message: "could not unlink file..." (this would come from the second backend that attempted an unlink()). You would not see this when a worker raised an error due to a duplicate violation, or any other routine problem, so it should really be almost impossible. That having been said, this probably *is* a problematic restriction in cases where a temp file's ownership is not immediately handed over without concurrent sharing. What happens to be a small window for the parallel CREATE INDEX patch probably wouldn't be a small window for parallel hash join. :-( It's not hard to see why I would like to do things this way. Just look at ResourceOwnerReleaseInternal(). Any release of a file happens during RESOURCE_RELEASE_AFTER_LOCKS, whereas the release of dynamic shared memory segments happens earlier, during RESOURCE_RELEASE_BEFORE_LOCKS. ISTM that the only sensible way to implement a refcount is using dynamic shared memory, and that seems hard. There are additional reasons why I suggest we go this way, such as the fact that all the relevant state belongs to BufFile, which is implemented a layer above all of the guts of resource management of temp files within fd.c. I'd have to replicate almost all state in fd.c to make it all work, which seems like a big modularity violation. Does anyone have any suggestions on how to tackle this? -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers