I believe that the WAL prefetch patch probably accounts for the
intermittent errors that buildfarm member topminnow has shown
since it went in, eg [1]:

diff -U3 
/home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out 
/home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out
--- /home/nm/ext4/HEAD/pgsql/contrib/pg_walinspect/expected/pg_walinspect.out   
2022-04-10 03:05:15.972622440 +0200
+++ 
/home/nm/ext4/HEAD/pgsql.build/contrib/pg_walinspect/results/pg_walinspect.out  
    2022-04-25 05:09:49.861642059 +0200
@@ -34,11 +34,7 @@
 (1 row)
 
 SELECT COUNT(*) >= 0 AS ok FROM 
pg_get_wal_records_info_till_end_of_wal(:'wal_lsn1');
- ok 
-----
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40
 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats(:'wal_lsn1', :'wal_lsn2');
  ok 
 ----
@@ -46,11 +42,7 @@
 (1 row)
 
 SELECT COUNT(*) >= 0 AS ok FROM pg_get_wal_stats_till_end_of_wal(:'wal_lsn1');
- ok 
-----
- t
-(1 row)
-
+ERROR:  could not read WAL at 0/1903E40
 -- ===================================================================
 -- Test for filtering out WAL records of a particular table
 -- ===================================================================


I've reproduced this manually on that machine, and confirmed that the
proximate cause is that XLogNextRecord() is returning NULL because
state->decode_queue_head == NULL, without bothering to provide an errormsg
(which doesn't seem very well thought out in itself).  I obtained the
contents of the xlogreader struct at failure:

(gdb) p *xlogreader
$1 = {routine = {page_read = 0x594270 <read_local_xlog_page_no_wait>, 
    segment_open = 0x593b44 <wal_segment_open>, 
    segment_close = 0x593d38 <wal_segment_close>}, system_identifier = 0, 
  private_data = 0x0, ReadRecPtr = 26230672, EndRecPtr = 26230752, 
  abortedRecPtr = 26230752, missingContrecPtr = 26230784, 
  overwrittenRecPtr = 0, DecodeRecPtr = 26230672, NextRecPtr = 26230752, 
  PrevRecPtr = 0, record = 0x0, decode_buffer = 0xf25428 "\240", 
  decode_buffer_size = 65536, free_decode_buffer = true, 
  decode_buffer_head = 0xf25428 "\240", decode_buffer_tail = 0xf25428 "\240", 
  decode_queue_head = 0x0, decode_queue_tail = 0x0, 
  readBuf = 0xf173f0 "\020\321\005", readLen = 0, segcxt = {
    ws_dir = '\000' <repeats 1023 times>, ws_segsize = 16777216}, seg = {
    ws_file = 25, ws_segno = 0, ws_tli = 1}, segoff = 0, 
  latestPagePtr = 26222592, latestPageTLI = 1, currRecPtr = 26230752, 
  currTLI = 1, currTLIValidUntil = 0, nextTLI = 0, 
  readRecordBuf = 0xf1b3f8 "<", readRecordBufSize = 40960, 
  errormsg_buf = 0xef3270 "", errormsg_deferred = false, nonblocking = false}

I don't have an intuition about where to look beyond that, any
suggestions?

What I do know so far is that while the failure reproduces fairly
reliably under "make check" (more than half the time, which squares
with topminnow's history), it doesn't reproduce at all under "make
installcheck" (after removing NO_INSTALLCHECK), which seems odd.
Maybe it's dependent on how much WAL history the installation has
accumulated?

It could be that this is a bug in pg_walinspect or a fault in its
test case; hard to tell since that got committed at about the same
time as the prefetch changes.

                        regards, tom lane

[1] 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=topminnow&dt=2022-04-25%2001%3A48%3A47


Reply via email to