On Tue, Jul 23, 2019 at 01:28:47PM -0400, Tom Lane wrote:
> ... you'd think an OOM kill would show up in the kernel log.
> (Not necessarily in dmesg, though. Did you try syslog?)
Nothing in /var/log/messages (nor dmesg ring).
I enabled abrtd while trying to reproduce it last week. Since you asked I
looked again in messages, and found it'd logged 10 hours ago about this:
(gdb) bt
#0 0x000000395be32495 in raise () from /lib64/libc.so.6
#1 0x000000395be33c75 in abort () from /lib64/libc.so.6
#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
#4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792,
flags=128) at xlog.c:9150
#5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
#6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
#7 0x000000000050fe75 in AuxiliaryProcessMain (argc=2, argv=0x7ffe00d56b00) at
bootstrap.c:451
#8 0x00000000006dcf54 in StartChildProcess (type=CheckpointerProcess) at
postmaster.c:5337
#9 0x00000000006de78a in reaper (postgres_signal_arg=<value optimized out>) at
postmaster.c:2867
#10 <signal handler called>
#11 0x000000395bee1603 in __select_nocancel () from /lib64/libc.so.6
#12 0x00000000006e1488 in ServerLoop (argc=<value optimized out>, argv=<value
optimized out>) at postmaster.c:1671
#13 PostmasterMain (argc=<value optimized out>, argv=<value optimized out>) at
postmaster.c:1380
#14 0x0000000000656420 in main (argc=3, argv=0x27ae410) at main.c:228
#2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at elog.c:555
edata = <value optimized out>
elevel = 22
oldcontext = 0x27e15d0
econtext = 0x0
__func__ = "errfinish"
#3 0x00000000006f7e94 in CheckPointReplicationOrigin () at origin.c:588
save_errno = <value optimized out>
tmppath = 0x9c4518 "pg_logical/replorigin_checkpoint.tmp"
path = 0x9c4300 "pg_logical/replorigin_checkpoint"
tmpfd = 64
i = <value optimized out>
magic = 307747550
crc = 4294967295
__func__ = "CheckPointReplicationOrigin"
#4 0x00000000004f6ef1 in CheckPointGuts (checkPointRedo=5507491783792,
flags=128) at xlog.c:9150
No locals.
#5 0x00000000004feff6 in CreateCheckPoint (flags=128) at xlog.c:8937
shutdown = false
checkPoint = {redo = 5507491783792, ThisTimeLineID = 1, PrevTimeLineID
= 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2141308, nextOid =
496731439, nextMulti = 1, nextMultiOffset = 0,
oldestXid = 561, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1,
time = 1563781930, oldestCommitTsXid = 0, newestCommitTsXid = 0,
oldestActiveXid = 2141308}
recptr = <value optimized out>
_logSegNo = <value optimized out>
Insert = <value optimized out>
freespace = <value optimized out>
PriorRedoPtr = <value optimized out>
curInsert = <value optimized out>
last_important_lsn = <value optimized out>
vxids = 0x280afb8
nvxids = 0
__func__ = "CreateCheckPoint"
#6 0x00000000006d49e2 in CheckpointerMain () at checkpointer.c:491
ckpt_performed = false
do_restartpoint = <value optimized out>
flags = 128
do_checkpoint = <value optimized out>
now = 1563781930
elapsed_secs = <value optimized out>
cur_timeout = <value optimized out>
rc = <value optimized out>
local_sigjmp_buf = {{__jmpbuf = {2, -1669940128760174522, 9083146, 0,
140728912407216, 140728912407224, -1669940128812603322, 1670605924426606662},
__mask_was_saved = 1, __saved_mask = {__val = {
18446744066192964103, 0, 246358747096, 140728912407296,
140446084917816, 140446078556040, 9083146, 0, 246346239061, 140446078556040,
140447207471460, 0, 140447207471424, 140446084917816, 0,
7864320}}}}
checkpointer_context = 0x27e15d0
__func__ = "CheckpointerMain"
Supposedly it's trying to do this:
| ereport(PANIC,
| (errcode_for_file_access(),
| errmsg("could not write to file \"%s\": %m",
| tmppath)));
And since there's consistently nothing in logs, I'm guessing there's a
legitimate write error (legitimate from PG perspective). Storage here is ext4
plus zfs tablespace on top of LVM on top of vmware thin volume.
Justin