Hi,

On 10/15/2017 12:42 AM, Justin Pryzby wrote:
> On Fri, Oct 13, 2017 at 10:57:32PM -0500, Justin Pryzby wrote:
>> I don't have any reason to believe there's memory issue on the server, So I
>> suppose this is just a "heads up" to early adopters until/in case it happens
>> again and I can at least provide a stack trace.
> 
> I'm back; find stacktrace below.
> 
>> Today I see:
>> < 2017-10-13 17:22:47.839 -04  >LOG:  server process (PID 32127) was 
>> terminated by signal 11: Segmentation fault
>> < 2017-10-13 17:22:47.839 -04  >DETAIL:  Failed process was running: 
>> autovacuum: BRIN summarize public.gtt 747263
> 
> Is it a coincidence the server failed within 45m of yesterday's failure ?
> 

Most likely just a coincidence.

> postmaster[26500] general protection ip:84a177 sp:7ffd9b349b88 error:0 in 
> postgres[400000+692000]
> < 2017-10-14 18:05:36.432 -04  >DETAIL:  Failed process was running: 
> autovacuum: BRIN summarize public.gtt 41087
> 
>> It looks like this table was being inserted into simultaneously by a python
>> program using multiprocessing.  It looks like each subprocess was INSERTing
>> into several tables, each of which has one BRIN index on timestamp column.
> 
> I should add:
> These are insert-only child tables in a heirarchy (not PG10 partitions), being
> inserted into directly (not via trigger/rule).
> 
> Also notice the vacuum process was interrupted, same as yesterday (think
> goodness for full logs).  Our INSERT script is using python
> multiprocessing.pool() with "maxtasksperchild=1", which I think means we load
> one file and then exit the subprocess, and pool() creates a new subproc, which
> starts a new PG session and transaction.  Which explains why autovacuum starts
> processing the table only to be immediately interrupted.
> 

I don't follow. Why does it explain that autovacuum gets canceled? I
mean, merely opening a new connection/session should not cancel
autovacuum. That requires a command that requires table-level lock
conflicting with autovacuum (so e.g. explicit LOCK command, DDL, ...).

> ... 
> Due to a .."behavioral deficiency" in the loader for those tables, the crashed
> backend causes the loader to get stuck, so the tables should be untouched 
> since
> the crash, should it be desirable to inspect them.
> 

It's a bit difficult to guess what went wrong from this backtrace. For
me gdb typically prints a bunch of lines immediately before the frames,
explaining what went wrong - not sure why it's missing here.

Perhaps some of those pointers are bogus, the memory was already pfree-d
or something like that. You'll have to poke around and try dereferencing
the pointers to find what works and what does not.

For example what do these gdb commands do in the #0 frame?

(gdb) p *(MemoryContext)context
(gdb) p *GetMemoryChunkContext(pointer)

> #0  pfree (pointer=0x298c740) at mcxt.c:954
>         context = 0x7474617261763a20
> #1  0x00000000006a52e9 in perform_work_item (workitem=0x7f8ad1f94824) at 
> autovacuum.c:2676
>         cur_datname = 0x298c740 "no 1 :vartype 1184 :vartypmod -1 :varcollid 
> 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 146} {CONST :consttype 
> 1184 :consttypmod -1 :constcollid 0 :constlen 8 :constbyval true :constisnull 
> fal"...
>         cur_nspname = 0x298c728 "s ({VAR :varno 1 :varattno 1 :vartype 1184 
> :vartypmod -1 :varcollid 0 :varlevelsup 0 :varnoold 1 :varoattno 1 :location 
> 146} {CONST :consttype 1184 :consttypmod -1 :constcollid 0 :constlen 8 
> :constbyv"...
>         cur_relname = 0x298cd68 
> "cdrs_eric_msc_sms_2017_10_14_startofcharge_idx"
>         __func__ = "perform_work_item"
> #2  0x00000000006a6fd9 in do_autovacuum () at autovacuum.c:2533
...

cheers

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to