On Sat, Aug 27, 2016 at 3:43 AM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Fri, Aug 26, 2016 at 6:20 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: >> Latest from lorikeet: >> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lorikeet&dt=2016-08-26%2008%3A37%3A27 >> >> TRAP: FailedAssertion("!(vmq->mq_sender == ((void *)0))", File: >> "/home/andrew/bf64/root/REL9_6_STABLE/pgsql.build/../pgsql/src/backend/storage/ipc/shm_mq.c", >> Line: 220) >> > > Do you think, it is due to some recent change or we are just seeing > now as it could be timing specific issue? > > So here what seems to be happening is that during worker startup, we > are trying to set the sender for a shared memory queue and the same is > already set. Now, one theoretical possibility of the same could be > that the two workers get the same ParallelWorkerNumber which is then > used to access the shm queue (refer > ParallelWorkerMain/ExecParallelGetReceiver). We are setting the > ParallelWorkerNumber in below code which seems to be doing what it is > suppose to do: > > LaunchParallelWorkers() > { > .. > for (i = 0; i < pcxt->nworkers; ++i) > { > memcpy(worker.bgw_extra, &i, sizeof(int)); > if (!any_registrations_failed && > RegisterDynamicBackgroundWorker(&worker, > &pcxt->worker[i].bgwhandle)) > .. > } > > Can some reordering impact the above code?
I don't think so. Your guess that ParallelWorkerNumber is getting messed up somehow seems like a good one, but I don't see anything wrong with that code. There's actually a pretty long chain here. That code copies the value of the local variable i into worker.bgw_extra. Then, RegisterDynamicBackgroundWorker copies the whole structure into shared memory. Then, running inside the postmaster, BackgroundWorkerStateChange copies it into the postmaster address space. But, since this is Windows, that copy doesn't actually passed to the worker; instead, BackgroundWorkerEntry() copies the data from shared memory into the new worker processes' MyBgworkerEntry. Then BackgroundWorkerMain() copies the data from there to ParallelWorkerNumber. In theory any of those places could be going wrong somehow, though none of them can be completely busted because they all work at least most of the time. Of course, it's also possible that the ParallelWorkerNumber code is entirely correct and something overwrote the null bytes that were supposed to be found at that location. It would be very useful to see (a) the value of ParallelWorkerNumber and (b) the contents of vmq->mq_sender, and in particular whether that's actually a valid pointer to a PGPROC in the ProcArray. But unless we can reproduce this I don't see how to manage that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers