Re: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.

Andres Freund Tue, 19 Sep 2017 18:07:15 -0700

On 2017-09-19 16:46:58 -0400, Tom Lane wrote:
> Andres Freund <[email protected]> writes:
> > So this is geniuinely interesting. When the machine is really loaded (as
> > in 6 animals running on a vm at the same time, incuding valgrind), psql
> > sometimes doesn't get the WARNING message from a shutdown. Instead it
> > gets
> > # psql:<stdin>:3: server closed the connection unexpectedly
> > #       This probably means the server terminated abnormally
> > #       before or while processing the request.
> > # psql:<stdin>:3: connection to server was lost
> 
> That seems pretty weird.  Maybe it's not the same case, but in
> 
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2017-09-19%2020%3A10%3A02
> 
> you can see from the postmaster log that the backend *is* issuing
> the message, or at least it's getting to the server log:
> 
> 2017-09-19 20:20:34.476 UTC [6363] [unknown] LOG:  connection received: 
> host=[local]
> 2017-09-19 20:20:34.477 UTC [6363] [unknown] LOG:  connection authorized: 
> user=andres database=postgres
> 2017-09-19 20:20:34.478 UTC [6363] t/013_crash_restart.pl LOG:  statement: 
> SELECT $$psql-connected$$;
> ...
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl WARNING:  
> terminating connection because of crash of another server process
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl DETAIL:  The 
> postmaster has commanded this server process to roll back the current 
> transaction and exit, because another server process exited abnormally and 
> possibly corrupted shared memory.
> 2017-09-19 20:20:34.485 UTC [6363] t/013_crash_restart.pl HINT:  In a moment 
> you should be able to reconnect to the database and repeat your command.
> 
> Have we forgotten an fflush() or something?


After hacking a fix for my previous theory, I started adding strace into
the mix, to verify this. Takes longer to reproduce, but after filtering
to -e trace=network, I got this:

socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 
ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = -1 
ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM, 0)         = 3
connect(3, {sa_family=AF_UNIX, sun_path="/tmp/EDkYotgk3u/.s.PGSQL.57230"}, 110) 
= 0
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
getsockname(3, {sa_family=AF_UNIX}, [128->2]) = 0
sendto(3, "\0\0\0O\0\3\0\0user\0andres\0database\0pos"..., 79, MSG_NOSIGNAL, 
NULL, 0) = 79
recvfrom(3, "R\0\0\0\10\0\0\0\0S\0\0\0,application_name\0t"..., 16384, 0, NULL, 
NULL) = 340
sendto(3, "Q\0\0\0\37SELECT $$psql-connected$$;\0", 32, MSG_NOSIGNAL, NULL, 0) 
= 32
recvfrom(3, 
"T\0\0\0!\0\1?column?\0\0\0\0\0\0\0\0\0\0\31\377\377\377\377\377\377"..., 
16384, 0, NULL, NULL) = 79
sendto(3, "Q\0\0\0\33SELECT pg_sleep(3600);\0", 28, MSG_NOSIGNAL, NULL, 0) = 28
recvfrom(3, 0x555817dae2a0, 16384, 0, NULL, NULL) = -1 ECONNRESET (Connection 
reset by peer)
+++ exited with 2 +++

So indeed, we got a connreset before receiving the proper error
message.

The corresponding server log (debug3):
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 730
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 716
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 715
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 717
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 718
2017-09-20 00:57:00.573 UTC [713] DEBUG:  sending SIGQUIT to process 719
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl WARNING:  terminating 
connection because of crash of another server process
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DETAIL:  The 
postmaster has commanded this server process to roll back the current t
ransaction and exit, because another server process exited abnormally and 
possibly corrupted shared memory.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl HINT:  In a moment you 
should be able to reconnect to the database and repeat your c
ommand.
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  
shmem_exit(-1): 0 before_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  
shmem_exit(-1): 0 on_shmem_exit callbacks to make
2017-09-20 00:57:00.573 UTC [720] DEBUG:  shmem_exit(-1): 0 before_shmem_exit 
callbacks to make
2017-09-20 00:57:00.573 UTC [730] t/013_crash_restart.pl DEBUG:  proc_exit(-1): 
0 callbacks to make
...
2017-09-20 00:57:00.577 UTC [713] DEBUG:  server process (PID 730) exited with 
exit code 2
2017-09-20 00:57:00.577 UTC [713] DETAIL:  Failed process was running: SELECT 
pg_sleep(3600);
2017-09-20 00:57:00.577 UTC [713] LOG:  all server processes terminated; 
reinitializing

So the server indeed was killed by SIGQUIT, not an escalation to
SIGKILL. And it output stuff to the server log, and didn't complain
about communication to the client... Odd.

Greetings,

Andres Freund


-- 
Sent via pgsql-committers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

Re: [COMMITTERS] pgsql: Make new crash restart test a bit more robust.

Reply via email to