On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <[email protected]> wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <[email protected]>
> > > wrote:
> > > > #2 0x000000000085ddff in errfinish (dummy=<value optimized out>) at
> > > > elog.c:555
> > > > edata = <value optimized out>
> > >
> > > If you have that core, it might be interesting to go to frame 2 and
> > > print *edata or edata->saved_errno.
> >
> > As you saw..unless someone you know a trick, it's "optimized out".
>
> How about something like this:
>
> print errorData[errordata_stack_depth]
Clever.
(gdb) p errordata[errordata_stack_depth]
$2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127,
show_funcname = false, hide_stmt = false, hide_ctx = false, filename =
0x27b3790 "< %m %u >", lineno = 41745456,
funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0,
context_domain = 0x27cff90 "postgres", sqlerrcode = 0, message = 0xe8800000001
<Address 0xe8800000001 out of bounds>,
detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint =
0xe88 <Address 0xe88 out of bounds>, context = 0x297a <Address 0x297a out of
bounds>, message_id = 0x0, schema_name = 0x0,
table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name =
0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0,
assoc_context = 0x0}
(gdb) p errordata
$3 = {{elevel = 22, output_to_server = true, output_to_client = false,
show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x9c4030
"origin.c", lineno = 591,
funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810
"postgres-11", context_domain = 0x9ac810 "postgres-11", sqlerrcode = 4293,
message = 0x27b0e40 "could not write to file
\"pg_logical/replorigin_checkpoint.tmp\": No space left on device", detail =
0x0, detail_log = 0x0, hint = 0x0, context = 0x0,
message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ...
I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).
I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.
But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ? Including by parallel workers? Probably
WALs, even after they're rotated ? If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.
Justin