On Thu, Aug 15, 2019 at 5:49 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
> So that leads to the thought that "the infinite_recurse test is fine
> if it runs by itself, but it tends to fall over if there are
> concurrently-running backends".  I have absolutely no idea how that
> would happen on anything that passes for a platform built in this
> century.  Still, it's a place to start, which we hadn't before.

Hmm.  mereswin's recent failure on REL_11_STABLE was running the
serial schedule.

I read about 3 ways to get SEGV from stack-related faults: you can
exceed RLIMIT_STACK (the total mapping size) and then you'll get SEGV
(per man pages), you can access a page that is inside the mapping but
is beyond the stack pointer (with some tolerance, exact details vary
by arch), and you can fail to allocate a page due to low memory.

The first kind of failure doesn't seem right -- we carefully set
max_stack_size based on RLIMIT_STACK minus some slop, so that theory
would require child processes to have different stack limits than the
postmaster as you said (perhaps OpenStack, Docker, related tooling or
concurrent activity on the host system is capable of changing it?), or
a bug in our slop logic.  The second kind of failure would imply that
we have a bug -- we're accessing something below the stack pointer --
but that doesn't seem right either -- I think various address
sanitising tools would have told us about that, and it's hard to
believe there is a bug in the powerpc and arm implementation of the
stack pointer check (see Linux arch/{powerpc,arm}/mm/fault.c).  That
leaves the third explanation, except then I'd expect to see other
kinds of problems like OOM etc before you get to that stage, and why
just here?  Confused.

> Also notable is that we now have a couple of hits on ARM, not
> only ppc64.  Don't know what to make of that.

Yeah, that is indeed interesting.

Thomas Munro

Reply via email to