On 4/21/21 6:30 PM, Tom Lane wrote: > Thomas Munro <thomas.mu...@gmail.com> writes: >> Yeah, it would have been nice to include that but it'll have to be for >> v15 due to lack of time to convince myself that it was correct. I do >> intend to look into more concurrency of that kind for v15. I have >> pushed these patches, updated to be disabled by default. > > I have a fairly bad feeling about these patches. I've already fixed > one critical bug (see 9e4114822), but I am still seeing random, hard > to reproduce failures in WAL replay testing. It looks like sometimes > the "decoded" version of a WAL record doesn't match what I see in > the on-disk data, which I'm having no luck tracing down. > > Another interesting failure I just came across is > > 2021-04-21 11:32:14.280 EDT [14606] LOG: incorrect resource manager data > checksum in record at F/438000A4 > TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, > PID: 14606) > 2021-04-21 11:38:23.066 EDT [14603] LOG: startup process (PID 14606) was > terminated by signal 6: Abort trap > > with stack trace > > #0 0x90b669f0 in kill () > #1 0x90c01bfc in abort () > #2 0x0057a6a0 in ExceptionalCondition (conditionName=<value temporarily > unavailable, due to optimizations>, errorType=<value temporarily unavailable, > due to optimizations>, fileName=<value temporarily unavailable, due to > optimizations>, lineNumber=<value temporarily unavailable, due to > optimizations>) at assert.c:69 > #3 0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 > '\001') at xlogreader.c:845 > #4 0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, > errormsg=0xbfffba9c) at xlogreader.c:466 > #5 0x000f695c in XLogReadRecord (state=<value temporarily unavailable, due > to optimizations>, record=0xbfffba98, errormsg=<value temporarily > unavailable, due to optimizations>) at xlogreader.c:352 > #6 0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 > '\0') at xlog.c:4398 > #7 0x000ea320 in StartupXLOG () at xlog.c:7567 > #8 0x00362218 in StartupProcessMain () at startup.c:244 > #9 0x000fc170 in AuxiliaryProcessMain (argc=<value temporarily unavailable, > due to optimizations>, argv=<value temporarily unavailable, due to > optimizations>) at bootstrap.c:447 > #10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439 > #11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406 > #12 0x0029737c in main (argc=<value temporarily unavailable, due to > optimizations>, argv=<value temporarily unavailable, due to optimizations>) > at main.c:209 > > > I am not sure whether the checksum failure itself is real or a variant > of the seeming bad-reconstruction problem, but what I'm on about right > at this moment is that the error handling logic for this case seems > quite broken. Why is a checksum failure only worthy of a LOG message? > Why is ValidXLogRecord() issuing a log message for itself, rather than > being tied into the report_invalid_record() mechanism? Why are we > evidently still trying to decode records afterwards? >
Yeah, that seems suspicious. > In general, I'm not too pleased with the apparent attitude in this > thread that it's okay to push a patch that only mostly works on the > last day of the dev cycle and plan to stabilize it later. > Was there such attitude? I don't think people were arguing for pushing a patch's not working correctly. The discussion was mostly about getting it committed even and leaving some optimizations for v15. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company