Re: stress test for parallel workers

Tom Lane Mon, 14 Oct 2019 17:03:52 -0700

I wrote:
> Filed at
> https://bugzilla.kernel.org/show_bug.cgi?id=205183
> We'll see what happens ...


Further to this --- I went back and looked at the outlier events
where we saw an infinite_recurse failure on a non-Linux-PPC64
platform.  There were only three:

 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | 
REL_11_STABLE | 2019-08-11 02:10:12 | InstallCheck-C  | 2019-08-11 02:36:10.159 
PDT [5004:4] DETAIL:  Failed process was running: select infinite_recurse();
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | 
REL_12_STABLE | 2019-08-11 09:52:46 | pg_upgradeCheck | 2019-08-11 04:21:16.756 
PDT [6804:5] DETAIL:  Failed process was running: select infinite_recurse();
 mereswine    | ARMv7            | Linux debian-armhf | Clarence Ho     | HEAD  
        | 2019-08-11 11:29:27 | pg_upgradeCheck | 2019-08-11 07:15:28.454 PDT 
[9954:76] DETAIL:  Failed process was running: select infinite_recurse();
 
Looking closer at these, though, they were *not* SIGSEGV failures,
but SIGKILLs.  Seeing that they were all on the same machine on the
same day, I'm thinking we can write them off as a transiently
misconfigured OOM killer.

So, pending some other theory emerging from the kernel hackers, we're
down to it's-a-PPC64-kernel-bug.  That leaves me wondering what if
anything we want to do about it.  Even if it's fixed reasonably promptly
in Linux upstream, and then we successfully nag assorted vendors to
incorporate the fix quickly, that's still going to leave us with frequent
buildfarm failures on Mark's flotilla of not-the-very-shiniest Linux
versions.

Should we move the infinite_recurse test to happen alone in a parallel
group just to stop these failures?  That's annoying from a parallelism
standpoint, but I don't see any other way to avoid these failures.

                        regards, tom lane

Re: stress test for parallel workers

Reply via email to