On Thu, Feb 7, 2019 at 12:47 PM Justin Pryzby <pry...@telsasoft.com> wrote: > However I *did* reproduce the error in an isolated, non-production postgres > instance. It's a total empty, untuned v11.1 initdb just for this, running > ONLY > a few simultaneous loops around just one query It looks like the simultaneous > loops sometimes (but not always) fail together. This has happened a couple > times. > > It looks like one query failed due to "could not attach" in leader, one failed > due to same in worker, and one failed with "not pinned", which I hadn't seen > before and appears to be related to DSM, not DSA...
Hmm. I hadn't considered that angle... Some kind of interference between unrelated DSA areas, or other DSM activity? I will also try to repro that here... > I'm also trying to reproduce on other production servers. But so far nothing > else has shown the bug, including the other server which hit our original > (other) DSA error with the queued_alters query. So I tentatively think there > really may be something specific to the server (not the hypervisor so maybe > the > OS, libraries, kernel, scheduler, ??). Initially I thought these might be two symptoms of the same corruption but I'm now starting to wonder if there are two bugs here: "could not allocate %d pages" (rare) might be a logic bug in the computation of contiguous_pages that requires a particular allocation pattern to hit, and "dsa_area could not attach to segment" (rarissimo) might be something else requiring concurrency/a race. One thing that might be useful would be to add a call to dsa_dump(area) just before the errors are raised, which will write a bunch of stuff out to stderr and might give us some clues. And to print out the variable "index" from get_segment_by_index() when it fails. I'm also going to try to work up some better assertions. -- Thomas Munro http://www.enterprisedb.com