On Wed, Jan 24, 2018 at 9:55 AM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > On Wed, Jan 24, 2018 at 3:45 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: >> On Wed, Jan 24, 2018 at 2:35 AM, Robert Haas <robertmh...@gmail.com> wrote: >>> In the case of Gather, for example, we won't >>> WaitForParallelWorkersToFinish() until we've read all the tuples. If >>> there's a tuple queue that does not yet have a sender, then we'll just >>> wait for it to get one, even if the sender fell victim to a fork >>> failure and is never going to show up. >>> >> >> Hmm, I think that case will be addressed because tuple queues can >> detect if the leader is not attached. It does in code path >> shm_mq_receive->shm_mq_counterparty_gone. In >> shm_mq_counterparty_gone, it can detect if the worker is gone by using >> GetBackgroundWorkerPid. Moreover, I have manually tested this >> particular case before saying your patch is fine. Do you have some >> other case in mind which I am missing? > > Hmm. Yeah. I can't seem to reach a stuck case and was probably just > confused and managed to confuse Robert too. If you make > fork_process() fail randomly (see attached), I see that there are a > couple of easily reachable failure modes (example session at bottom of > message): >
In short, we are good with committed code. Right? > 1. HandleParallelMessages() is reached and raises a "lost connection > to parallel worker" error because shm_mq_receive() returns > SHM_MQ_DETACHED, I think because shm_mq_counterparty_gone() checked > GetBackgroundWorkerPid() just as you said. I guess that's happening > because some other process is (coincidentally) sending > PROCSIG_PARALLEL_MESSAGE at shutdown, causing us to notice that a > process is unexpectedly stopped. > > 2. WaitForParallelWorkersToFinish() is reached and raises a "parallel > worker failed to initialize" error. TupleQueueReaderNext() set done > to true, because shm_mq_receive() returned SHM_MQ_DETACHED. Once > again, that is because shm_mq_counterparty_gone() returned true. This > is the bit Robert and I missed in our off-list discussion. > > As long as we always get our latch set by the postmaster after a fork > failure (ie kill SIGUSR1) and after GetBackgroundWorkerPid() is > guaranteed to return BGWH_STOPPED after that, and as long as we only > ever use latch/CFI loops to wait, and as long as we try to read from a > shm_mq, then I don't see a failure mode that hangs. > > This doesn't help a parallel.c client that doesn't try to read from a > shm_mq though, like parallel CREATE INDEX (using the proposed design > without dynamic Barrier). > Yes, this is what I am trying to explain on parallel create index thread. I think there we need to either use WaitForParallelWorkersToFinish or WaitForParallelWorkersToAttach (a new API as proposed in that thread) if we don't want to use barriers. I see a minor disadvantage in using WaitForParallelWorkersToFinish which I will say on that thread. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com