database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe?

Merlin Moncure Fri, 13 Mar 2020 12:09:00 -0700

I have 5 servers in a testing environment that are comprise a data
warehousing cluster.   They will typically get each get exactly the
same query at approximately the same time.  Yesterday, around 1pm, 3
of the five got stuck on the same query.  Each of them yields similar
stack traces.  This  happens now and then.  The server is 9.6.12
(which is obviously old, but I did not see any changes in relevant
code).


(gdb) bt
#0  0x00007fe856c0b463 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x00000000006b4416 in WaitEventSetWaitBlock (nevents=1,
occurred_events=0x7ffc9f2b0f60, cur_timeout=-1, set=0x27cace8) at
latch.c:1053
#2  WaitEventSetWait (set=0x27cace8, timeout=timeout@entry=-1,
occurred_events=occurred_events@entry=0x7ffc9f2b0f60,
nevents=nevents@entry=1) at latch.c:1007
#3  0x00000000005f26dd in secure_write (port=0x27f16a0,
ptr=ptr@entry=0x27f5528, len=len@entry=192) at be-secure.c:255
#4  0x00000000005fb51b in internal_flush () at pqcomm.c:1410
#5  0x00000000005fb72a in internal_putbytes (s=0x2a4f245 "14M04",
s@entry=0x2a4f228 "", len=70) at pqcomm.c:1356
#6  0x00000000005fb7f0 in socket_putmessage (msgtype=68 'D',
s=0x2a4f228 "", len=<optimized out>) at pqcomm.c:1553
#7  0x00000000005fd5d9 in pq_endmessage (buf=buf@entry=0x7ffc9f2b1040)
at pqformat.c:347
#8  0x0000000000479a63 in printtup (slot=0x2958fc8, self=0x2b6bca0) at
printtup.c:372
#9  0x00000000005c1cc9 in ExecutePlan (dest=0x2b6bca0,
direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
operation=CMD_SELECT,
    use_parallel_mode=<optimized out>, planstate=0x2958cf8,
estate=0x2958be8) at execMain.c:1606
#10 standard_ExecutorRun (queryDesc=0x2834998, direction=<optimized
out>, count=0) at execMain.c:339
#11 0x00000000006d69a7 in PortalRunSelect
(portal=portal@entry=0x2894e38, forward=forward@entry=1 '\001',
count=0, count@entry=9223372036854775807,
    dest=dest@entry=0x2b6bca0) at pquery.c:948
#12 0x00000000006d7dbb in PortalRun (portal=0x2894e38,
count=9223372036854775807, isTopLevel=<optimized out>, dest=0x2b6bca0,
altdest=0x2b6bca0,
    completionTag=0x7ffc9f2b14e0 "") at pquery.c:789
#13 0x00000000006d5a06 in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=<optimized out>, username=<optimized
out>) at postgres.c:1109
#14 0x000000000046fc28 in BackendRun (port=0x27f16a0) at postmaster.c:4342
#15 BackendStartup (port=0x27f16a0) at postmaster.c:4016
#16 ServerLoop () at postmaster.c:1721
#17 0x0000000000678119 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x27c8c90) at postmaster.c:1329
#18 0x000000000047088e in main (argc=3, argv=0x27c8c90) at main.c:228
(gdb) quit

Now, the fact that this happened to multiple servers at time strongly
suggest an external (to the database) problem.  The system initiating
the query, a cross database query over dblink, has been has given up
(and has been restarted as a precaution) a long time ago, and the
connection is dead.   secure_write() sets however an infinite timeout
to the latch, and there are clearly scenarios where epoll waits
forever for an event that is never going to occur.  If/when this
happens, the only recourse is to restart the impacted database.  The
question is, shouldn't the latch have a looping timeout that checks
for interrupts?   What would the risks be of jumping directly out of
the latch loop?

merlin

database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe?

Reply via email to