Recap ===== A design goal of parallel tuplesort is that the parallel case be as close to possible as the serial case is already. Very little new code is needed in tuplesort.c and logtape.c, most of which is about using a new lower-level facility which is very well encapsulated. And, even buffile.c "unification", the guts of everything, doesn't have many new special cases.
So, workers start with "unifiable" BufFiles, which have very little difference with conventional temp BufFiles -- they just create filenames in a deterministic fashion, making them discoverable to the leader process, through a process of unification (plus you have the refcount state in shared memory, though very little). In principle, workers could decide to not go through with unification long after the fact, and close their unifiable temp files without doing anything for the leader, and that would be just fine. By the same token, the unified BufFile owned by the leader can be extended if needs be, in a way that is virtually indistinguishable from just extending the end of any other BufFile. logtape.c recycling for future randomAccess cases works just the same as before. I think that this is a nice state of affairs, since having no special cases will tend to make the code robust. Applying leader-wise offsets within logtape.c is done at one precise choke-point (one function), and so it seems very likely that logtape.c routines that CREATE INDEX happens to not use will "just work" when parallel tuplesort has more clients at some point in the future. A logical tapeset doesn't even remember specifically that it is using unification (unless you count that it has a non-zero offset to apply). V8 of the parallel tuplesort patch added a refcount mechanism that is associated with a "unified" BufFile, consisting of $nworker unifiable BufFiles, with metadata in shared memory per worker/unifiable BufFile. We have a refcount per unifiable BufFile, not per fd.c segment. The refcount thing lets worker processes go away as soon as the leader begins its final on-the-fly merge -- the leader assumes ownership of the BufFiles/segments initially owned only by workers. There is a very brief period of co-ownership, which occurs during unification in the leader. Concern ======= My pending V9 of the patch has to solve a problem that V8 had, namely that it's not very nice that an error in the leader (e.g. palloc() failure) during this brief period of co-ownership will make the resource manager attempt a double unlink() (one from the worker, another from the leader). There is no danger of data loss, but it's ugly. The general consensus is that the way to handle it is with a on_dsm_detach() callback with the top-level parallel context shared memory segment. I have already written some rough code that does all this, and demonstrably removes the double unlink() hazard, but I have some outstanding concerns, that I'd like to clear up. We have at least 3 different "resource contexts" in play in this rough code: 1. The DSM segment. 2. fd.c segment clean-up (i.e., cleanup that falls on resourceOwner stored within buffile.c for BufFile segments). 3. The memory context in which BufFile stuff is allocated. I am bridging the gap between local memory and shared memory here, in essence. The local state like virtual FDs is what we need to mark as non-temp, so the second unlink() is never attempted on fd.c resource cleanup, which comes last of all (per resource manager) on the one hand. We do need to touch refcounts in shared memory too, but shared memory only has only some relevant state, which seems like something to avoid changing, because we want to keep all of those nice advantages I went into above. On the other hand, it seems wrong to jump from shared memory to shared local memory in my on_dsm_detach() callback, though that's what works for me right now. The callback is only registered in the leader process, and only really needs to be around for as long as there is a brief period of co-ownership of BufFiles. Say the caller never gets around to a BufFileClose() call. That's a bug anyway, and will produce a temp file reference leak warning today. But this change turns that bug into a hard crash, which is a more serious bug. This happens because the on_dsm_detach() callback ends up calling BufFileClose() against a pointer to garbage (freed local memory). Had BufFileClose() been called before resource cleanup called the callback, which should happen anyway, then it would have marked the pointer to local memory (stored in *shared* memory) NULL. No hard crash would happen from the callback, which returns early, never spuriously calling BufFileClose() a second time within leader. ISTM that even today, buffile.c caller's memory context tacitly shares the lifetime of caller's resourceOwner. Otherwise, everything would break. It doesn't seem that much of a stretch to assume that the DSM seg callback implicitly has the opportunity to "go first", since ISTM that things are ordered within AbortTransaction() and other similar places such that that does always happen. Testing shows my rough code does the right thing when an error is thrown from a variety of places. But: * I'm still stashing a pointer to local memory in shared memory, purely out of convenience, which is just weird. * It seems wrong that a piece of infrastructure pretty far removed from the core transaction manager code relies on ordering like that. It's not as if the BufFile contract requires that the segment original from the parallel infrastructure at all, though that's what I do right this minute. And it seems wrong that the memory better not have been freed already, through a normal pfree() or similar, without caller actually going through BufFileClose(), which knows to unset pointer within shared memory (or, alternatively, with the callback occurring before local memory is freed, in the case where the callback "really needs to fire"). I certainly don't grok resource managers like Robert does, so a much better approach might be right in front of me. The "obvious solution" is to have a lot more state in shared memory, but per-segment state needs to grow unpredictably. Plus, that would be a big, risky change to make for only a small benefit, at least as far as parallel tuplesort goes. Thomas does require something significantly more sophisticated for his parallel hash join patch, but it looks increasingly likely that that will end up being specialized to the problem he is trying to solve. I'm also slightly tempted to hard-code BufFiles as a new type of resource that a resource manager can take ownership of, but that also seems unappealing. What I've come up with may be as robust as anything will be for parallel CREATE INDEX alone, but I want to have confidence that any assumptions made are clear, and are correct. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers