On Sun, Jan 19, 2014 at 7:53 PM, Andrew Dunstan <and...@dunslane.net> wrote:
> Also crake does produce backtraces  on core dumps, and they are at the
> bottom of the buildfarm log. The latest failure backtrace is reproduced
> below.
>    ================== stack trace:
> /home/bf/bfr/root/HEAD/inst/data-C/core.12584 ==================
>    [New LWP 12584]
>    [Thread debugging using libthread_db enabled]
>    Using host libthread_db library "/lib64/libthread_db.so.1".
>    Core was generated by `postgres: buildfarm
> contrib_regression_test_shm_mq'.
>    Program terminated with signal 11, Segmentation fault.
>    #0  SetLatch (latch=0x1c) at pg_latch.c:509
>    509          if (latch->is_set)
>    #0  SetLatch (latch=0x1c) at pg_latch.c:509
>    #1  0x000000000064c04e in procsignal_sigusr1_handler
> (postgres_signal_arg=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289
>    #2  <signal handler called>
>    #3  _dl_fini () at dl-fini.c:190
>    #4  0x000000361ba39931 in __run_exit_handlers (status=0,
> listp=0x361bdb1668, run_list_atexit=true) at exit.c:78
>    #5  0x000000361ba399b5 in __GI_exit (status=<optimized out>) at
> exit.c:100
>    #6  0x00000000006485a6 in proc_exit (code=0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143
>    #7  0x0000000000663abb in PostgresMain (argc=<optimized out>,
> argv=<optimized out>, dbname=0x12b8170 "contrib_regression_test_shm_mq",
> username=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225
>    #8  0x000000000062220f in BackendRun (port=0x12d6bf0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083
>    #9  BackendStartup (port=0x12d6bf0) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772
>    #10 ServerLoop () at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583
>    #11 PostmasterMain (argc=<optimized out>, argv=<optimized out>) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238
>    #12 0x000000000045e2e8 in main (argc=3, argv=0x12b7430) at
> /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205

Hmm, that looks an awful lot like the SIGUSR1 signal handler is
getting called after we've already completed shmem_exit.  And indeed
that seems like the sort of thing that would result in dying horribly
in just this way.  The obvious fix seems to be to check
proc_exit_inprogress before doing anything that might touch shared
memory, but there are a lot of other SIGUSR1 handlers that don't do
that either.  However, in those cases, the likely cause of a SIGUSR1
would be a sinval catchup interrupt or a recovery conflict, which
aren't likely to be so far delayed that they arrive after we've
already disconnected from shared memory.  But the dynamic background
workers stuff adds a new possible cause of SIGUSR1: the postmaster
letting us know that a child has started or died.  And that could
happen even after we've detached shared memory.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

