Hi, On 2020-08-18 01:21:17 -0400, Tom Lane wrote: > Andres Freund <[email protected]> writes: > > I'd written to Tom that I was planning to revert unless the number of > > failures were lower than initially indicated. But that actually seems to > > have come to pass (the failures are quicker to report because they don't > > run the subsequent tests, of course). I'd like to let the failures > > accumulate a bit longer, say until tomorrow Midday if I haven't figured > > it out by then. With the hope of finding some detail to help pinpoint > > the issue. > > There's certainly no obvious pattern here, so I agree with waiting for > more data.
FWIW, I think I have found the bug, but I'm still working to reproduce the issue reliably enough that I can verify that the fix actually works. The issue is basically that 2PC PREPARE is weird, WRT procarray. The last snapshot built with GetSnapshotData() before the PREPARE doesn't include its own transaction in ->xip[], as normal. PrepareTransaction() removes the "normal" entry with ProcArrayClearTransaction(), which so far doesn't increase the xact completion count. Because the xact completion count is not increased, snapshots can be reused as long as they're taken before the 2PC transaction is finished. That's fine for other backends, but for the backend doing the PrepareTransaction() it's not, because there ->xip doesn't include the own backend. It's a bit tricky to reproduce exactly the issue the BF is occasionally hitting, because the way ->xmax is computed *limits* the damage. Combined with the use of SERIALIZABLE (preventing recomputation of the data snapshot) that makes it somewhat hard to hit. Greetings, Andres Freund
