On Tue, 8 Jun 2021 at 13:03, Justin Pryzby <[email protected]> wrote: > > On Sun, Jun 06, 2021 at 11:00:38AM -0700, Peter Geoghegan wrote: > > On Sun, Jun 6, 2021 at 9:35 AM Justin Pryzby <[email protected]> wrote: > > > I'll leave the instance running for a little bit before restarting (or > > > kill-9) > > > in case someone requests more info. > > > > How about dumping the page image out, and sharing it with the list? > > This procedure should work fine from gdb: > > > > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#Dumping_a_page_image_from_within_GDB > > > I suggest that you dump the "page" pointer inside lazy_scan_prune(). I > > imagine that you have the instance already stuck in an infinite loop, > > so what we'll probably see from the page image is the page after the > > first prune and another no-progress prune. > > The cluster was again rejecting with "too many clients already". > > I was able to open a shell this time, but it immediately froze when I tried to > tab complete "pg_stat_acti"... > > I was able to dump the page image, though - attached. I can send you its > "data" privately, if desirable. I'll also try to step through this.
Could you attach a dump of lazy_scan_prune's vacrel, all the global visibility states (GlobalVisCatalogRels, and possibly GlobalVisSharedRels, GlobalVisDataRels, and GlobalVisTempRels), and heap_page_prune's PruneState? Additionally, the locals of lazy_scan_prune (more specifically, the 'offnum' when it enters heap_page_prune) would also be appreciated, as it helps indicate the tuple. I've been looking at whatever might have done this, and I'm currently stuck on lacking information in GlobalVisCatalogRels and the PruneState. One curiosity that I did notice is that the t_xmax of the problematic tuples has been exactly one lower than the OldestXmin. Not weird, but a curiosity. With regards, Matthias van de Meent. PS. Attached a few of my current research notes, which are mainly comparisons between heap_prune_satisfies_vacuum and HeapTupleSatisfiesVacuum.
# Analysis of what can happen In heap_prune_chain, heap_prune_satisfies_vacuum (HPSV) is used for visibility checks instead of HeapTupleSatisfiesVacuum (HTSV). Both functions use HeapTupleSatisfiesVacuumHorizon (HTSVH), but differ in one behaviour: Handling HEAPTUPLE_RECENTLY_DEAD. More specifically, when HTSVH returns RECENTLY_DEAD, HTSV will return DEAD when the dead_after result from HTSVH precedes vacrel->OldestXmin (using XID). HPSV however will do this: - when dead_after precedes prstate->old_snap_xmin (but only when OldSnapshotThresholdActive(), so not here I presume. using XID) - when dead_after is removable according to GlobalVisTestIsRemovableXid (using the GlobalVisState applicable for that relation, this case GlobalVisCatalogRels, using FXID) This GlobalVisTestIsRemovableXid returns true when the FXid of the tuple (as generated relative to globalVisState->definitely_needed) is less than globalVisState->maybe_needed. One more item is that globalVisState->maybe_needed is set from the same value as what is later returned and set to vacrel->OldestXmin.
