On Thu, Jan 5, 2017 at 1:15 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: > Robert Haas <robertmh...@gmail.com> writes: >> No, I think backend-lifetime is right. The tranche registrations are >> all backend-local state, so there's no problem with backend A >> registering a string at one address and backend B registering a string >> at a different address. It's just important that neither of those >> strings goes away before the corresponding backend does. > > Then I don't understand what's going on. Isn't the crash you fixed > because backend A was looking at the tranche containing the lock backend B > was blocked on? How can a tranche name local to backend B work?
I think the chain of events must be as follows: 1. Backend A executed a parallel query at least once. During the course of doing so, it registered the parallel_query_dsa tranche, but after the execution of the parallel query completed, the tranche_name pointer was no longer valid, because it pointed into a DSM segment that got removed at the end of the query, rather than using backend-lifespan memory as advocated by the header comment of LWLockRegisterTranche. 2. Backend B began executing a parallel query and, somehow, ended up blocking on one of the DSA LWLocks. I'm not quite clear how this could've happened, because I didn't think we had any committed code that did DSA allocations yet, but maybe those locks are taken while attaching to the DSA or something like that. The details don't really matter: somehow, B managed to advertise LWTRANCHE_PARALLEL_QUERY_DSA in pg_stat_activity. 3. At exactly that moment, Backend A examined pg_stat_activity and saw that tranche ID and tried to look it up, following the dangling pointer left behind by step #1. Boom. If step #1 had not occurred, at the last step, backend A would have seen the tranche ID as unregistered and it would not have crashed. Instead, pg_stat_activity would have shown the wait event as "extension" rather than crashing. That's still wrong, of course, though not as bad as crashing. Basically, there are two problems here: - Every backend needs to register every built-in tranche ID at backend startup. If it doesn't, then it might read some other backend's wait event as "extension" instead of the correct value. The DSA code had the mistaken idea that tranche IDs only need to be registered before acquiring locks that use that tranche ID, which isn't true: we might need to interpret a tranche ID that some OTHER backend has advertised via pg_stat_activity. A corollary is that an extension that uses a tranche ID should register that tranche ID as early as possible (i.e. _PG_init()) so that a backend which has loaded but not used an extension can interpret that tranche ID if some other backend advertises it in pg_stat_activity. - Every backend needs to follow the directions and store the tranche name in memory that won't go away before the backend itself. If it doesn't, an attempt to interpret a tranche ID read from pg_stat_activity may follow a dangling pointer and crash, which is what must have been happening here. I suspect you're going to tell me this all needs to be better documented, which is probably a valid criticism. Suggestions as to where such documentation should be added - either as code comments or in a README somewhere or in doc/src/sgml - will be gratefully accepted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers