On 5/3/21 7:42 AM, Thomas Munro wrote:
On Sun, May 2, 2021 at 3:16 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
That last point means that there was some hard-to-hit problem even
before any of the recent WAL-related changes.  However, 323cbe7c7
(Remove read_page callback from XLogReader) increased the failure
rate by at least a factor of 5, and 1d257577e (Optionally prefetch
referenced data) seems to have increased it by another factor of 4.
But it looks like f003d9f87 (Add circular WAL decoding buffer)
didn't materially change the failure rate.

Oh, wow.  There are several surprising results there.  Thanks for
running those tests for so long so that we could see the rarest
failures.

Even if there are somehow *two* causes of corruption, one preexisting
and one added by the refactoring or decoding patches, I'm struggling
to understand how the chance increases with 1d2575, since that only
adds code that isn't reached when not enabled (though I'm going to
re-review that).

Considering that 323cbe7c7 was supposed to be just refactoring,
and 1d257577e is allegedly disabled-by-default, these are surely
not the results I was expecting to get.

+1

It seems like it's still an open question whether all this is
a real bug, or flaky hardware.  I have seen occasional kernel
freezeups (or so I think -- machine stops responding to keyboard
or network input) over the past year or two, so I cannot in good
conscience rule out the flaky-hardware theory.  But it doesn't
smell like that kind of problem to me.  I think what we're looking
at is a timing-sensitive bug that was there before (maybe long
before?) and these commits happened to make it occur more often
on this particular hardware.  This hardware is enough unlike
anything made in the past decade that it's not hard to credit
that it'd show a timing problem that nobody else can reproduce.

Hmm, yeah that does seem plausible.  It would be nice to see a report
from any other system though.  I'm still trying, and reviewing...


FWIW I've ran the test (make installcheck-parallel in a loop) on four different machines - two x86_64 ones, and two rpi4. The x86 boxes did ~1000 rounds each (and one of them had 5 local replicas) without any issue. The rpi4 machines did ~50 rounds each, also without failures.

Obviously, it's possible there's something that neither of those (very different systems) triggers, but I'd say it might also be a hint that this really is a hw issue on the old ppc macs. Or maybe something very specific to that arch.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to