Re: WIP: WAL prefetch (another approach)

Tomas Vondra Wed, 21 Apr 2021 13:08:23 -0700

On 4/21/21 6:30 PM, Tom Lane wrote:
> Thomas Munro <[email protected]> writes:
>> Yeah, it would have been nice to include that but it'll have to be for
>> v15 due to lack of time to convince myself that it was correct.  I do
>> intend to look into more concurrency of that kind for v15.  I have
>> pushed these patches, updated to be disabled by default.
> 
> I have a fairly bad feeling about these patches.  I've already fixed
> one critical bug (see 9e4114822), but I am still seeing random, hard
> to reproduce failures in WAL replay testing.  It looks like sometimes
> the "decoded" version of a WAL record doesn't match what I see in
> the on-disk data, which I'm having no luck tracing down.
> 
> Another interesting failure I just came across is
> 
> 2021-04-21 11:32:14.280 EDT [14606] LOG:  incorrect resource manager data 
> checksum in record at F/438000A4
> TRAP: FailedAssertion("state->decoding", File: "xlogreader.c", Line: 845, 
> PID: 14606)
> 2021-04-21 11:38:23.066 EDT [14603] LOG:  startup process (PID 14606) was 
> terminated by signal 6: Abort trap
> 
> with stack trace
> 
> #0  0x90b669f0 in kill ()
> #1  0x90c01bfc in abort ()
> #2  0x0057a6a0 in ExceptionalCondition (conditionName=<value temporarily 
> unavailable, due to optimizations>, errorType=<value temporarily unavailable, 
> due to optimizations>, fileName=<value temporarily unavailable, due to 
> optimizations>, lineNumber=<value temporarily unavailable, due to 
> optimizations>) at assert.c:69
> #3  0x000f5cf4 in XLogDecodeOneRecord (state=0x1000640, allow_oversized=1 
> '\001') at xlogreader.c:845
> #4  0x000f682c in XLogNextRecord (state=0x1000640, record=0xbfffba38, 
> errormsg=0xbfffba9c) at xlogreader.c:466
> #5  0x000f695c in XLogReadRecord (state=<value temporarily unavailable, due 
> to optimizations>, record=0xbfffba98, errormsg=<value temporarily 
> unavailable, due to optimizations>) at xlogreader.c:352
> #6  0x000e61a0 in ReadRecord (xlogreader=0x1000640, emode=15, fetching_ckpt=0 
> '\0') at xlog.c:4398
> #7  0x000ea320 in StartupXLOG () at xlog.c:7567
> #8  0x00362218 in StartupProcessMain () at startup.c:244
> #9  0x000fc170 in AuxiliaryProcessMain (argc=<value temporarily unavailable, 
> due to optimizations>, argv=<value temporarily unavailable, due to 
> optimizations>) at bootstrap.c:447
> #10 0x0035c740 in StartChildProcess (type=StartupProcess) at postmaster.c:5439
> #11 0x00360f4c in PostmasterMain (argc=5, argv=0xa006a0) at postmaster.c:1406
> #12 0x0029737c in main (argc=<value temporarily unavailable, due to 
> optimizations>, argv=<value temporarily unavailable, due to optimizations>) 
> at main.c:209
> 
> 
> I am not sure whether the checksum failure itself is real or a variant
> of the seeming bad-reconstruction problem, but what I'm on about right
> at this moment is that the error handling logic for this case seems
> quite broken.  Why is a checksum failure only worthy of a LOG message?
> Why is ValidXLogRecord() issuing a log message for itself, rather than
> being tied into the report_invalid_record() mechanism?  Why are we
> evidently still trying to decode records afterwards?
>


Yeah, that seems suspicious.

> In general, I'm not too pleased with the apparent attitude in this
> thread that it's okay to push a patch that only mostly works on the
> last day of the dev cycle and plan to stabilize it later.
> 

Was there such attitude? I don't think people were arguing for pushing a
patch's not working correctly. The discussion was mostly about getting
it committed even and leaving some optimizations for v15.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WIP: WAL prefetch (another approach)

Reply via email to