Hi, I'm still fighting with snapshot_too_old. The feature is just badly undertested, underdocumented, and there's lots of other oddities. I've now spent about as much time on that feature than on the whole rest of the patchset.
As an example for under-documented, here's a definitely non-trivial block of code without a single comment explaining what it's doing. if (oldSnapshotControl->count_used > 0 && ts >= oldSnapshotControl->head_timestamp) { int offset; offset = ((ts - oldSnapshotControl->head_timestamp) / USECS_PER_MINUTE); if (offset > oldSnapshotControl->count_used - 1) offset = oldSnapshotControl->count_used - 1; offset = (oldSnapshotControl->head_offset + offset) % OLD_SNAPSHOT_TIME_MAP_ENTRIES; xlimit = oldSnapshotControl->xid_by_minute[offset]; if (NormalTransactionIdFollows(xlimit, recentXmin)) SetOldSnapshotThresholdTimestamp(ts, xlimit); } LWLockRelease(OldSnapshotTimeMapLock); Also, SetOldSnapshotThresholdTimestamp() acquires a separate spinlock - not great to call that with the lwlock held. Then there's this comment: /* * Failsafe protection against vacuuming work of active transaction. * * This is not an assertion because we avoid the spinlock for * performance, leaving open the possibility that xlimit could advance * and be more current; but it seems prudent to apply this limit. It * might make pruning a tiny bit less aggressive than it could be, but * protects against data loss bugs. */ if (TransactionIdIsNormal(latest_xmin) && TransactionIdPrecedes(latest_xmin, xlimit)) xlimit = latest_xmin; if (NormalTransactionIdFollows(xlimit, recentXmin)) return xlimit; So this is not using lock, so the values aren't accurate, but it avoids data loss bugs? I also don't know which spinlock is avoided on the path here as mentioend - the acquisition is unconditional. But more importantly - if this is about avoiding data loss bugs, how on earth is it ok that we don't go through these checks in the old_snapshot_threshold == 0 path? /* * Zero threshold always overrides to latest xmin, if valid. Without * some heuristic it will find its own snapshot too old on, for * example, a simple UPDATE -- which would make it useless for most * testing, but there is no principled way to ensure that it doesn't * fail in this way. Use a five-second delay to try to get useful * testing behavior, but this may need adjustment. */ if (old_snapshot_threshold == 0) { if (TransactionIdPrecedes(latest_xmin, MyProc->xmin) && TransactionIdFollows(latest_xmin, xlimit)) xlimit = latest_xmin; ts -= 5 * USECS_PER_SEC; SetOldSnapshotThresholdTimestamp(ts, xlimit); return xlimit; } This feature looks like it was put together by applying force until something gave, and then stopping just there. Greetings, Andres Freund