Hello Noah,

31.10.2024 04:39, Noah Misch wrote:
I had pushed this during the indicated week, before your mail.  Reverting it
is an option.  Let's see if more opinions arrive.

I've accidentally discovered an incorrect behaviour caused by commit
4eac5a1fa. Running this script:
for ((j=1; j<=100; j++)); do
echo "iteration $j"

cat << 'EOF' | timeout 60 psql >>psql-$SID.log || { res=1; echo "hanged on iteration 
$j"; break; }
SELECT format('CREATE TABLE t%s (a int, b text);', g) FROM generate_series(1, 
50) g
\gexec

SELECT format('DROP TABLE t%s;', g) FROM generate_series(1, 50) g
\gexec
EOF
done

with
autovacuum = on
autovacuum_naptime = 1s
autovacuum_vacuum_threshold = 1
autovacuum_analyze_threshold = 1

in parallel using separate servers (the full script is attached), like:
parallel -j40 --linebuffer --tag .../reproi.sh ::: `seq 40`

I can catch the following:
...
3       hanged on iteration 51
...
19      hanged on iteration 64
...
39      hanged on iteration 99

And after the script run, I see the server processes hanging:
law      1081433       1  0 16:22 ?        00:00:00 
.../usr/local/pgsql/bin/postgres
law      1081452 1081433  0 16:22 ?        00:00:00 postgres: checkpointer
law      1081453 1081433  0 16:22 ?        00:00:00 postgres: background writer
law      1081460 1081433  0 16:22 ?        00:00:00 postgres: walwriter
law      1081462 1081433  0 16:22 ?        00:00:00 postgres: autovacuum 
launcher
law      1081464 1081433  0 16:22 ?        00:00:00 postgres: logical 
replication launcher
law      1143065 1081433  0 16:32 ?        00:00:00 postgres: postgres postgres 
[local] CREATE TABLE
law      1143263 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker 
postgres
law      1143320 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker 
postgres
law      1143403 1081433  0 16:32 ?        00:00:00 postgres: autovacuum worker


Attaching to process 1143065
...
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=265, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:57 #1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x7fed9a8171b8) at ./nptl/futex-internal.c:87 #2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7fed9a8171b8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at ./nptl/futex-internal.c:139 #3  0x00007feda4674c5f in do_futex_wait (sem=sem@entry=0x7fed9a8171b8, abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, 
abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327
#6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) 
at lwlock.c:1318
#7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
#8  0x0000561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, slot=0x561dd8cb75a0, call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146
...
(the full backtrace is attached)

All three autovacuum workers (1143263, 1143320, 1143403) are also waiting
for the same buffer lock:
#5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a817338) at pg_sema.c:327
#6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) 
at lwlock.c:1318
#7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182

Probably, this can be reproduced with VACUUM pg_class/pg_type/..., but I
haven't found out the exact combination needed yet.

Also as a side note, these processes can't be terminated with SIGTERM, I
have to kill them.

Initially I saw this on a slowed down VM, but with the attached patch
applied I could reproduce it on my workstation too.

Best regards,
Alexander

Attachment: reproi.sh
Description: application/shellscript

diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2168259247..730ef9b5a2 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -138,6 +138,10 @@ heapam_index_fetch_tuple(struct IndexFetchTableData *scan,
 			heap_page_prune_opt(hscan->xs_base.rel, hscan->xs_cbuf);
 	}
 
+uint64 r = 0;
+for (uint64 i = 0; i < 100000; i++) r += (r ^ i);
+Assert(r != 0);
+
 	/* Obtain share-lock on the buffer so we can examine visibility */
 	LockBuffer(hscan->xs_cbuf, BUFFER_LOCK_SHARE);
 	got_heap_tuple = heap_hot_search_buffer(tid,
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, 
abstime=0x0, op=265, expected=0, futex_word=0x7fed9a8171b8) at 
./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, 
abstime=0x0, clockid=0, expected=0, futex_word=0x7fed9a8171b8) at 
./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 
(futex_word=futex_word@entry=0x7fed9a8171b8, expected=expected@entry=0, 
clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>) at 
./nptl/futex-internal.c:139
#3  0x00007feda4674c5f in do_futex_wait (sem=sem@entry=0x7fed9a8171b8, 
abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007feda4674cf8 in __new_sem_wait_slow64 (sem=0x7fed9a8171b8, 
abstime=0x0, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x0000561dd715f1fe in PGSemaphoreLock (sema=0x7fed9a8171b8) at pg_sema.c:327
#6  0x0000561dd722fe02 in LWLockAcquire (lock=0x7fed9ad9b4e4, mode=LW_SHARED) 
at lwlock.c:1318
#7  0x0000561dd71f8423 in LockBuffer (buffer=36, mode=1) at bufmgr.c:4182
#8  0x0000561dd6d4f506 in heapam_index_fetch_tuple (scan=0x561dd8cb6588, 
tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, slot=0x561dd8cb75a0, 
call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6) at heapam_handler.c:146
#9  0x0000561dd6d6a7da in table_index_fetch_tuple (scan=0x561dd8cb6588, 
tid=0x561dd8cb64d0, snapshot=0x561dd8bfee28, slot=0x561dd8cb75a0, 
call_again=0x561dd8cb64d6, all_dead=0x7ffdd63842c6)
    at ../../../../src/include/access/tableam.h:1226
#10 0x0000561dd6d6bd12 in index_fetch_heap (scan=0x561dd8cb6470, 
slot=0x561dd8cb75a0) at indexam.c:622
#11 0x0000561dd6d6beb8 in index_getnext_slot (scan=0x561dd8cb6470, 
direction=ForwardScanDirection, slot=0x561dd8cb75a0) at indexam.c:682
#12 0x0000561dd6d69b49 in systable_getnext (sysscan=0x561dd8cb6358) at 
genam.c:512
#13 0x0000561dd7400229 in ScanPgRelation (targetRelId=1259, indexOK=true, 
force_non_historic=false) at relcache.c:385
#14 0x0000561dd7403c1f in RelationReloadNailed (relation=0x7feda4c05b80) at 
relcache.c:2381
#15 0x0000561dd7404122 in RelationClearRelation (relation=0x7feda4c05b80, 
rebuild=true) at relcache.c:2527
#16 0x0000561dd7403556 in RelationIdGetRelation (relationId=1259) at 
relcache.c:2088
#17 0x0000561dd6cd819e in relation_open (relationId=1259, lockmode=1) at 
relation.c:59
#18 0x0000561dd6dbaa60 in table_open (relationId=1259, lockmode=1) at table.c:43
#19 0x0000561dd74001bc in ScanPgRelation (targetRelId=56237, indexOK=true, 
force_non_historic=false) at relcache.c:368
#20 0x0000561dd7403884 in RelationReloadIndexInfo (relation=0x7feda4a3e978) at 
relcache.c:2257
#21 0x0000561dd74041a7 in RelationClearRelation (relation=0x7feda4a3e978, 
rebuild=true) at relcache.c:2551
#22 0x0000561dd7404a76 in RelationFlushRelation (relation=0x7feda4a3e978) at 
relcache.c:2829
#23 0x0000561dd7404c38 in RelationCacheInvalidateEntry (relationId=56237) at 
relcache.c:2900
#24 0x0000561dd73f4971 in LocalExecuteInvalidationMessage (msg=0x561dd8bfe598) 
at inval.c:666
#25 0x0000561dd73f46c6 in ProcessInvalidationMessages (group=0x561dd8bfe2b8, 
func=0x561dd73f4861 <LocalExecuteInvalidationMessage>) at inval.c:537
#26 0x0000561dd73f581e in CommandEndInvalidationMessages () at inval.c:1291
#27 0x0000561dd6dd6d4e in AtCCI_LocalCache () at xact.c:1560
#28 0x0000561dd6dd660b in CommandCounterIncrement () at xact.c:1111
#29 0x0000561dd6e333a2 in index_build (heapRelation=0x7feda4a3a2b0, 
indexRelation=0x7feda4a3e978, indexInfo=0x561dd8cb51b0, isreindex=false, 
parallel=true) at index.c:3108
#30 0x0000561dd6e2fb0f in index_create (heapRelation=0x7feda4a3a2b0, 
indexRelationName=0x7ffdd6384f00 "pg_toast_56233_index", indexRelationId=56237, 
parentIndexRelid=0, parentConstraintId=0, relFileNode=0, 
indexInfo=0x561dd8cb51b0, 
    indexColNames=0x561dd8ca1010, accessMethodObjectId=403, tableSpaceId=0, 
collationObjectId=0x7ffdd6384e60, classObjectId=0x7ffdd6384e68, 
coloptions=0x7ffdd6384e5c, reloptions=0, flags=1, constr_flags=0, 
    allow_system_table_mods=true, is_internal=true, constraintId=0x0) at 
index.c:1254
#31 0x0000561dd6e676b7 in create_toast_table (rel=0x7feda4c1b140, toastOid=0, 
toastIndexOid=0, reloptions=0, lockmode=8, check=false, OIDOldToast=0) at 
toasting.c:326
#32 0x0000561dd6e66eec in CheckAndCreateToastTable (relOid=56233, reloptions=0, 
lockmode=8, check=false, OIDOldToast=0) at toasting.c:90
#33 0x0000561dd6e66e85 in NewRelationCreateToastTable (relOid=56233, 
reloptions=0) at toasting.c:77
#34 0x0000561dd7251db9 in ProcessUtilitySlow (pstate=0x561dd8c91610, 
pstmt=0x561dd8bd5c40, queryString=0x561dd8bd4a40 "CREATE TABLE t70 (a int, b 
text);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
    dest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at utility.c:1199
#35 0x0000561dd7251a54 in standard_ProcessUtility (pstmt=0x561dd8bd5c40, 
queryString=0x561dd8bd4a40 "CREATE TABLE t70 (a int, b text);", 
readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
    dest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at utility.c:1074
#36 0x0000561dd7250992 in ProcessUtility (pstmt=0x561dd8bd5c40, 
queryString=0x561dd8bd4a40 "CREATE TABLE t70 (a int, b text);", 
readOnlyTree=false, context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, 
dest=0x561dd8bd5d30, 
    qc=0x7ffdd63857b0) at utility.c:530
#37 0x0000561dd724f23e in PortalRunUtility (portal=0x561dd8c45c60, 
pstmt=0x561dd8bd5c40, isTopLevel=true, setHoldSnapshot=false, 
dest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at pquery.c:1158
#38 0x0000561dd724f4c9 in PortalRunMulti (portal=0x561dd8c45c60, 
isTopLevel=true, setHoldSnapshot=false, dest=0x561dd8bd5d30, 
altdest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at pquery.c:1315
#39 0x0000561dd724e8f5 in PortalRun (portal=0x561dd8c45c60, 
count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x561dd8bd5d30, 
altdest=0x561dd8bd5d30, qc=0x7ffdd63857b0) at pquery.c:791
#40 0x0000561dd7247314 in exec_simple_query (query_string=0x561dd8bd4a40 
"CREATE TABLE t70 (a int, b text);") at postgres.c:1254
#41 0x0000561dd724c5ef in PostgresMain (dbname=0x561dd8c03ae0 "postgres", 
username=0x561dd8c03ab8 "postgres") at postgres.c:4686
#42 0x0000561dd7170ac7 in BackendRun (port=0x561dd8bfc5b0) at postmaster.c:4514
#43 0x0000561dd717034e in BackendStartup (port=0x561dd8bfc5b0) at 
postmaster.c:4242
#44 0x0000561dd716c2db in ServerLoop () at postmaster.c:1809
#45 0x0000561dd716ba38 in PostmasterMain (argc=1, argv=0x561dd8bcec10) at 
postmaster.c:1481
#46 0x0000561dd705e929 in main (argc=1, argv=0x561dd8bcec10) at main.c:202

Reply via email to