On Wed, Apr 18, 2018 at 8:52 AM, Jonathan Rudenberg <jonat...@titanous.com> wrote: > Hundreds of queries stuck with a wait_event of DynamicSharedMemoryControlLock > and pg_terminate_backend did not terminate the queries. > > In the log: > >> FATAL: cannot unpin a segment that is not pinned
Thanks for the report. That error is reachable via two paths: 1. Cleanup of a DSA area at the end of a query, giving back all segments. This is how the bug originally reported in this thread reached it, and that's because of a case where we tried to double-destroy the DSA area when refcount went down to zero, then back up again, and then back to zero (late starting parallel worker that attached in a narrow time window). That was fixed in fddf45b3: once it reaches zero we recognise it as already destroyed and don't even let anyone attach. 2. In destroy_superblock(), called by dsa_free(), when we're where we've determined that a 64kb superblock can be given back to the DSM segment, and that the DSM segment is now entirely free so can be given back to the operating system. To do that, after we put the pages back into the free page manager we test fpm_largest(segment_map->fpm) == segment_map->header->usable_pages to see if the largest span of free pages is now the same size as the whole segment. I don't have any theories about how that could be going wrong right now, but I'm looking into it. There could be a logic bug in dsa.c, or a logic bug in client code running an invalid sequence of dsa_allocate(), dsa_free() calls that corrupts state (I wonder if a well timed double dsa_free() could produce this effect), or a common-or-garden overrun bug somewhere that trashes control state. > I don't have a backtrace yet, but I will provide them if/when the issue > happens again. Thanks, that would be much appreciated, as would any clues about what workload you're running. Do you know what the query plan looks like for the queries that crashed? -- Thomas Munro http://www.enterprisedb.com