On Fri, Feb 10, 2017 at 11:31 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Thu, Feb 9, 2017 at 5:09 PM, Thomas Munro > <thomas.mu...@enterprisedb.com> wrote: >> I agree that the pinned segment case doesn't matter right now, I just >> wanted to point it out. I like your $10 wrench analogy, but maybe it >> could be argued that adding a dsm_on_destroy() callback mechanism is >> not only better than adding another refcount to track that other >> refcount, but also a steal at only $8. > > If it's that simple, it might be worth doing, but I bet it's not. One > problem is that there's a race condition: there will inevitably be a > period of time after you've called dsm_attach() and before you've > attached to the specific data structure that we're talking about here. > So suppose the last guy who actually knows about this data structure > dies horribly and doesn't clean up because the DSM isn't being > destroyed; moments later, you die horribly before reaching the code > where you attach to this data structure. Oops.
Right, I mentioned this problem earlier ("and also register the on_dsm_detach callback, before any chance that an error might be thrown (is that difficult?); failure to do so could result in file leaks"). Here's my thought process... please tell me where I'm going wrong: I have been assuming that it's not enough to just deal with this when the leader detaches on the theory that other participants will always detach first: that probably isn't true in some error cases, and could contribute to spurious racy errors where other workers complain about disappearing files if the leader somehow shuts down and cleans up while a worker is still running. Therefore we need *some* kind of refcounting, whether it's a new kind or a new mechanism based on the existing kind. I have also been assuming that we don't want to teach dsm.c directly about this stuff; it shouldn't need to know about other modules, so we don't want it talking to buffile.c directly and managing a special table of files; instead we want a system of callbacks. Therefore client code needs to do something after attaching to the segment in each backend. It doesn't matter whether we use an on_dsm_detach() callback and manage our own refcount to infer that destruction is imminent, or a new on_dsm_destroy() callback which tells us so explicitly: both ways we'll need to make sure that anyone who attaches to the segment also "attaches" to this shared BufFile manager object inside it, because any backend might turn out to be the one that is last to detach. That bring us to the race you mentioned. Isn't it sufficient to say that you aren't allowed to do anything that might throw in between attaching to the segment and attaching to the SharedBufFileManager that it contains? Up until two minutes ago I assumed that policy would leave only two possibilities: you attach to the DSM segment and attach to the SharedBufFileManager successfully or you attach to the DSM segment and then die horribly (but not throw) and the postmaster restarts the whole cluster and blows all temp files away with RemovePgTempFiles(). But I see now in the comment of that function that crash-induced restarts don't call that because "someone might want to examine the temp files for debugging purposes". Given that policy for regular private BufFiles, I don't see why that shouldn't apply equally to shared files: after a crash restart, you may have some junk files that won't be cleaned up until your next clean restart, whether they were private or shared BufFiles. > You might think about plugging that hole by moving the registry of > on-destroy functions into the segment itself and making it a shared > resource. But ASLR breaks that, especially for loadable modules. You > could try to fix that problem, in turn, by storing arguments that can > later be passed to load_external_function() instead of a function > pointer per se. But that sounds pretty fragile because some other > backend might not try to load the module until after it's attached the > DSM segment and it might then fail because loading the module runs > _PG_init() which can throw errors. Maybe you can think of a way to > plug that hole too but you're waaaaay over your $8 budget by this > point. Agreed, those approaches seem like a non-starters. >> My problem here is that I don't know how many batches I'll finish up >> creating. [...] > > I thought the idea was that the structure we're talking about here > owns all the files, up to 2 from a leader that wandered off plus up to > 2 for each worker. Last process standing removes them. Or are you > saying each worker only needs 2 files but the leader needs a > potentially unbounded number? Yes, potentially unbounded in rare case. If we plan for N batches, and then run out of work_mem because our estimates were just wrong or the distributions of keys is sufficiently skewed, we'll run HashIncreaseNumBatches, and that could happen more than once. I have a suite of contrived test queries that hits all the various modes and code paths of hash join, and it includes a query that plans for one batch but finishes up creating many, and then the leader exits. I'll post that to the other thread along with my latest patch series soon. >> Perhaps we can find a way to describe a variable number of BufFiles >> (ie batches) in a fixed space by making sure the filenames are >> constructed in a way that lets us just have to say how many there are. > > That could be done. Cool. >> Then the next problem is that for each BufFile we have to know how >> many 1GB segments there are to unlink (files named foo, foo.1, foo.2, >> ...), which Peter's code currently captures by publishing the file >> size in the descriptor... but if a fixed size object must describe N >> BufFiles, where can I put the size of each one? Maybe I could put it >> in a header of the file itself (yuck!), or maybe I could decide that I >> don't care what the size is, I'll simply unlink "foo", then "foo.1", >> then "foo.2", ... until I get ENOENT. > > There's nothing wrong with that algorithm as far as I'm concerned. Cool. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers