Hi hackers, I heard a report of a 10.1 cluster hanging with several 'BtreePage' wait_events showing in pg_stat_activity. The query plan involved Parallel Index Only Scan, and the table is concurrently updated quite heavily. I tried and failed to make a reproducer, but from the clues available it seemed clear that somehow *all* participants in a Parallel Index Scan must be waiting for someone else to advance the scan. The report came with a back trace[1] that was the same in all 3 backends (leader + 2 workers), which I'll summarise here:
ConditionVariableSleep _bt_parallel_seize _bt_readnextpage _bt_steppage _bt_next btgettuple index_getnext_tid IndexOnlyNext I think _bt_steppage() called _bt_parallel_seize(), then it called _bt_readnextpage() which I guess must have encountered a BTP_DELETED or BTP_HALF_DEAD-marked page so didn't take this early break out of the loop: /* check for deleted page */ if (!P_IGNORE(opaque)) { PredicateLockPage(rel, blkno, scan->xs_snapshot); /* see if there are any matches on this page */ /* note that this will clear moreRight if we can stop */ if (_bt_readpage(scan, dir, P_FIRSTDATAKEY(opaque))) break; } ... and then it called _bt_parallel_seize() itself, in violation of the rule (by my reading of the code) that you must call _bt_parallel_release() (via _bt_readpage()) or _bt_parallel_done() after seizing the scan. If you call _bt_parallel_seize() again without doing that first, you'll finish up waiting for yourself forever. Does this theory make sense? [1] http://dpaste.com/05PGJ4E -- Thomas Munro http://www.enterprisedb.com