Re: stress test for parallel workers

Justin Pryzby Tue, 23 Jul 2019 16:05:49 -0700

On Wed, Jul 24, 2019 at 10:46:42AM +1200, Thomas Munro wrote:
> On Wed, Jul 24, 2019 at 10:42 AM Justin Pryzby <[email protected]> wrote:
> > On Wed, Jul 24, 2019 at 10:03:25AM +1200, Thomas Munro wrote:
> > > On Wed, Jul 24, 2019 at 5:42 AM Justin Pryzby <[email protected]> 
> > > wrote:
> > > > #2  0x000000000085ddff in errfinish (dummy=<value optimized out>) at 
> > > > elog.c:555
> > > >         edata = <value optimized out>
> > >
> > > If you have that core, it might be interesting to go to frame 2 and
> > > print *edata or edata->saved_errno.
> >
> > As you saw..unless someone you know a trick, it's "optimized out".
> 
> How about something like this:
> 
> print errorData[errordata_stack_depth]


Clever.

(gdb) p errordata[errordata_stack_depth]
$2 = {elevel = 13986192, output_to_server = 254, output_to_client = 127, 
show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 
0x27b3790 "< %m %u >", lineno = 41745456, 
  funcname = 0x3030313335 <Address 0x3030313335 out of bounds>, domain = 0x0, 
context_domain = 0x27cff90 "postgres", sqlerrcode = 0, message = 0xe8800000001 
<Address 0xe8800000001 out of bounds>, 
  detail = 0x297a <Address 0x297a out of bounds>, detail_log = 0x0, hint = 
0xe88 <Address 0xe88 out of bounds>, context = 0x297a <Address 0x297a out of 
bounds>, message_id = 0x0, schema_name = 0x0, 
  table_name = 0x0, column_name = 0x0, datatype_name = 0x0, constraint_name = 
0x0, cursorpos = 0, internalpos = 0, internalquery = 0x0, saved_errno = 0, 
assoc_context = 0x0}
(gdb) p errordata
$3 = {{elevel = 22, output_to_server = true, output_to_client = false, 
show_funcname = false, hide_stmt = false, hide_ctx = false, filename = 0x9c4030 
"origin.c", lineno = 591, 
    funcname = 0x9c46e0 "CheckPointReplicationOrigin", domain = 0x9ac810 
"postgres-11", context_domain = 0x9ac810 "postgres-11", sqlerrcode = 4293, 
    message = 0x27b0e40 "could not write to file 
\"pg_logical/replorigin_checkpoint.tmp\": No space left on device", detail = 
0x0, detail_log = 0x0, hint = 0x0, context = 0x0, 
    message_id = 0x8a7aa8 "could not write to file \"%s\": %m", ...

I ought to have remembered that it *was* in fact out of space this AM when this
core was dumped (due to having not touched it since scheduling transition to
this VM last week).

I want to say I'm almost certain it wasn't ENOSPC in other cases, since,
failing to find log output, I ran df right after the failure.

But that gives me an idea: is it possible there's an issue with files being
held opened by worker processes ?  Including by parallel workers?  Probably
WALs, even after they're rotated ?  If there were worker processes holding
opened lots of rotated WALs, that could cause ENOSPC, but that wouldn't be
obvious after they die, since the space would then be freed.

Justin

Re: stress test for parallel workers

Reply via email to