On Thu, Aug 15, 2019 at 5:49 PM Tom Lane <t...@sss.pgh.pa.us> wrote: > So that leads to the thought that "the infinite_recurse test is fine > if it runs by itself, but it tends to fall over if there are > concurrently-running backends". I have absolutely no idea how that > would happen on anything that passes for a platform built in this > century. Still, it's a place to start, which we hadn't before.
Hmm. mereswin's recent failure on REL_11_STABLE was running the serial schedule. I read about 3 ways to get SEGV from stack-related faults: you can exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV (per man pages), you can access a page that is inside the mapping but is beyond the stack pointer (with some tolerance, exact details vary by arch), and you can fail to allocate a page due to low memory. The first kind of failure doesn't seem right -- we carefully set max_stack_size based on RLIMIT_STACK minus some slop, so that theory would require child processes to have different stack limits than the postmaster as you said (perhaps OpenStack, Docker, related tooling or concurrent activity on the host system is capable of changing it?), or a bug in our slop logic. The second kind of failure would imply that we have a bug -- we're accessing something below the stack pointer -- but that doesn't seem right either -- I think various address sanitising tools would have told us about that, and it's hard to believe there is a bug in the powerpc and arm implementation of the stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c). That leaves the third explanation, except then I'd expect to see other kinds of problems like OOM etc before you get to that stage, and why just here? Confused. > Also notable is that we now have a couple of hits on ARM, not > only ppc64. Don't know what to make of that. Yeah, that is indeed interesting. -- Thomas Munro https://enterprisedb.com