Hello Noah, 31.10.2024 04:39, Noah Misch wrote:
I had pushed this during the indicated week, before your mail. Reverting it is an option. Let's see if more opinions arrive.
I've accidentally discovered an incorrect behaviour caused by commit 4eac5a1fa. Running this script: for ((j=1; j<=100; j++)); do echo "iteration $j" cat << 'EOF' | timeout 60 psql >>psql-$SID.log || { res=1; echo "hanged on iteration $j"; break; } SELECT format('CREATE TABLE t%s (a int, b text);', g) FROM generate_series(1, 50) g \gexec SELECT format('DROP TABLE t%s;', g) FROM generate_series(1, 50) g \gexec EOF done with autovacuum = on autovacuum_naptime = 1s autovacuum_vacuum_threshold = 1 autovacuum_analyze_threshold = 1 in parallel using separate servers (the full script is attached), like: parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40` I can catch the following: ... 3 hanged on iteration 51 ... 19 hanged on iteration 64 ... 39 hanged on iteration 99 And after the script run, I see the server processes hanging: law 1081433 1 0 16:22 ? 00:00:00 .../usr/local/pgsql/bin/postgres law 1081452 1081433 0 16:22 ? 00:00:00 postgres: checkpointer law 1081453 1081433 0 16:22 ? 00:00:00 postgres: background writer law 1081460 1081433 0 16:22 ? 00:00:00 postgres: walwriter law 1081462 1081433 0 16:22 ? 00:00:00 postgres: autovacuum launcher law 1081464 1081433 0 16:22 ? 00:00:00 postgres: logical replication launcher law 1143065 1081433 0 16:32 ? 00:00:00 postgres: postgres postgres [local] CREATE TABLE law 1143263 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker postgres law 1143320 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker postgres law 1143403 1081433 0 16:32 ? 00:00:00 postgres: autovacuum worker Attaching to process 1143065 ... (gdb) bt#0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fed9a8171b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139 #3 0x00007feda4674c5f in do_futex_wait (sem=sem@entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4 0x00007feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183 #5 0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327 #6 0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318 #7 0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182#8 0x0000561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146
... (the full backtrace is attached) All three autovacuum workers (1143263, 1143320, 1143403) are also waiting for the same buffer lock: #5 0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327 #6 0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318 #7 0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182 Probably, this can be reproduced with VACUUM pg_class/pg_type/..., but I haven't found out the exact combination needed yet. Also as a side note, these processes can't be terminated with SIGTERM, I have to kill them. Initially I saw this on a slowed down VM, but with the attached patch applied I could reproduce it on my workstation too. Best regards, Alexander
reproi.sh
Description: application/shellscript
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c index 2168259247..730ef9b5a2 100644 --- a/src/backend/access/heap/heapam_handler.c +++ b/src/backend/access/heap/heapam_handler.c @@ -138,6 +138,10 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan, heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf); } +uint64 r = 0; +for (uint64 i = 0; i < 100000; i++) r += (r ^ i); +Assert(r != 0); + /* Obtain share-lock on the buffer so we can examine visibility */ LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE); got_heap_tuple = heap_hot_search_buffer(tid,
(gdb) bt #0 __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57 #1 __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87 #2 __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fed9a8171b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139 #3 0x00007feda4674c5f in do_futex_wait (sem=sem@entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111 #4 0x00007feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183 #5 0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327 #6 0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) at lwlock.c:1318 #7 0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182 #8 0x0000561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146 #9 0x0000561dd6d6a7da in table_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at ../../../../src/include/access/tableam.h:1226 #10 0x0000561dd6d6bd12 in index_fetch_heap (scan=0x561dd8cb6470, slot=0x561dd8cb75a0) at indexam.c:622 #11 0x0000561dd6d6beb8 in index_getnext_slot (scan=0x561dd8cb6470, direction=ForwardScanDirection, slot=0x561dd8cb75a0) at indexam.c:682 #12 0x0000561dd6d69b49 in systable_getnext (sysscan=0x561dd8cb6358) at genam.c:512 #13 0x0000561dd7400229 in ScanPgRelation (targetRelId=1259, indexOK=true, force_non_historic=false) at relcache.c:385 #14 0x0000561dd7403c1f in RelationReloadNailed (relation=0x7feda4c05b80) at relcache.c:2381 #15 0x0000561dd7404122 in RelationClearRelation (relation=0x7feda4c05b80, rebuild=true) at relcache.c:2527 #16 0x0000561dd7403556 in RelationIdGetRelation (relationId=1259) at relcache.c:2088 #17 0x0000561dd6cd819e in relation_open (relationId=1259, lockmode=1) at relation.c:59 #18 0x0000561dd6dbaa60 in table_open (relationId=1259, lockmode=1) at table.c:43 #19 0x0000561dd74001bc in ScanPgRelation (targetRelId=56237, indexOK=true, force_non_historic=false) at relcache.c:368 #20 0x0000561dd7403884 in RelationReloadIndexInfo (relation=0x7feda4a3e978) at relcache.c:2257 #21 0x0000561dd74041a7 in RelationClearRelation (relation=0x7feda4a3e978, rebuild=true) at relcache.c:2551 #22 0x0000561dd7404a76 in RelationFlushRelation (relation=0x7feda4a3e978) at relcache.c:2829 #23 0x0000561dd7404c38 in RelationCacheInvalidateEntry (relationId=56237) at relcache.c:2900 #24 0x0000561dd73f4971 in LocalExecuteInvalidationMessage (msg=0x561dd8bfe598) at inval.c:666 #25 0x0000561dd73f46c6 in ProcessInvalidationMessages (group=0x561dd8bfe2b8, func=0x561dd73f4861 <LocalExecuteInvalidationMessage>) at inval.c:537 #26 0x0000561dd73f581e in CommandEndInvalidationMessages () at inval.c:1291 #27 0x0000561dd6dd6d4e in AtCCI_LocalCache () at xact.c:1560 #28 0x0000561dd6dd660b in CommandCounterIncrement () at xact.c:1111 #29 0x0000561dd6e333a2 in index_build (heapRelation=0x7feda4a3a2b0, indexRelation=0x7feda4a3e978, indexInfo=0x561dd8cb51b0, isreindex=false, parallel=true) at index.c:3108 #30 0x0000561dd6e2fb0f in index_create (heapRelation=0x7feda4a3a2b0, indexRelationName=0x7ffdd6384f00 "pg_toast_56233_index", indexRelationId=56237, parentIndexRelid=0, parentConstraintId=0, relFileNode=0, indexInfo=0x561dd8cb51b0, indexColNames=0x561dd8ca1010, accessMethodObjectId=403, tableSpaceId=0, collationObjectId=0x7ffdd6384e60, classObjectId=0x7ffdd6384e68, coloptions=0x7ffdd6384e5c, reloptions=0, flags=1, constr_flags=0, allow_system_table_mods=true, is_internal=true, constraintId=0x0) at index.c:1254 #31 0x0000561dd6e676b7 in create_toast_table (rel=0x7feda4c1b140, toastOid=0, toastIndexOid=0, reloptions=0, lockmode=8, check=false, OIDOldToast=0) at toasting.c:326 #32 0x0000561dd6e66eec in CheckAndCreateToastTable (relOid=56233, reloptions=0, lockmode=8, check=false, OIDOldToast=0) at toasting.c:90 #33 0x0000561dd6e66e85 in NewRelationCreateToastTable (relOid=56233, reloptions=0) at toasting.c:77 #34 0x0000561dd7251db9 in ProcessUtilitySlow (pstate=0x561dd8c91610, pstmt=0x561dd8bd5c40, queryString=0x561dd8bd4a40 "CREATE TABLE t70 (a int, b text);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at utility.c:1199 #35 0x0000561dd7251a54 in standard_ProcessUtility (pstmt=0x561dd8bd5c40, queryString=0x561dd8bd4a40 "CREATE TABLE t70 (a int, b text);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at utility.c:1074 #36 0x0000561dd7250992 in ProcessUtility (pstmt=0x561dd8bd5c40, queryString=0x561dd8bd4a40 "CREATE TABLE t70 (a int, b text);", readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at utility.c:530 #37 0x0000561dd724f23e in PortalRunUtility (portal=0x561dd8c45c60, pstmt=0x561dd8bd5c40, isTopLevel=true, setHoldSnapshot=false, dest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at pquery.c:1158 #38 0x0000561dd724f4c9 in PortalRunMulti (portal=0x561dd8c45c60, isTopLevel=true, setHoldSnapshot=false, dest=0x561dd8bd5d30, altdest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at pquery.c:1315 #39 0x0000561dd724e8f5 in PortalRun (portal=0x561dd8c45c60, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x561dd8bd5d30, altdest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at pquery.c:791 #40 0x0000561dd7247314 in exec_simple_query (query_string=0x561dd8bd4a40 "CREATE TABLE t70 (a int, b text);") at postgres.c:1254 #41 0x0000561dd724c5ef in PostgresMain (dbname=0x561dd8c03ae0 "postgres", username=0x561dd8c03ab8 "postgres") at postgres.c:4686 #42 0x0000561dd7170ac7 in BackendRun (port=0x561dd8bfc5b0) at postmaster.c:4514 #43 0x0000561dd717034e in BackendStartup (port=0x561dd8bfc5b0) at postmaster.c:4242 #44 0x0000561dd716c2db in ServerLoop () at postmaster.c:1809 #45 0x0000561dd716ba38 in PostmasterMain (argc=1, argv=0x561dd8bcec10) at postmaster.c:1481 #46 0x0000561dd705e929 in main (argc=1, argv=0x561dd8bcec10) at main.c:202