Re: [HACKERS] Proposal for CSN based snapshots

Alvaro Herrera Thu, 10 Jul 2014 14:35:12 -0700

Heikki Linnakangas wrote:

> Attached is a new patch. It now keeps the current pg_clog unchanged,
> but adds a new pg_csnlog besides it. pg_csnlog is more similar to
> pg_subtrans than pg_clog: it's not WAL-logged, is reset at startup,
> and segments older than GlobalXmin can be truncated.
> 
> This addresses the disk space consumption, and simplifies pg_upgrade.
> 
> There are no other significant changes in this new version, so it's
> still very much WIP. But please take a look!


Thanks.  I've been looking at this.  It needs a rebase; I had to apply
to commit 89cf2d52030 (Wed Jul 2 13:11:05 2014 -0400) because of
conflicts with the commit after that; it applies with some fuzz.  I read
the discussion in this thread and the README, to try to understand how
this is supposed to work.  It looks pretty good.

One thing not quite clear to me is the now-static RecentXmin,
GetRecentXmin(), AdvanceGlobalRecentXmin, and the like.  I found the
whole thing pretty confusing.  I noticed that ShmemVariableCache->recentXmin
is read without a lock in GetSnapshotData(), but exclusive ProcArrayLock
is taken to write to it in GetRecentXmin().  I think we can write it
without a lock also, since we assume that storing an xid is atomic.
With that change, I think you can make the acquisition of ProcArray Lock
to walk the pgproc list use shared mode, not exclusive.

Another point is that RecentXmin is no longer used directly, except in
TransactionIdIsActive().  But that routine is only used for an Assert()
now, so I think it's fine to just have GetRecentXmin() return the value
and not set RecentXmin as a side effect; my point is that ISTM
RecentXmin can be removed altogether, which makes that business simpler.


As far as GetOldestXmin() is concerned, that routine now ignores its
arguments.  Is that okay?  I imagine it's just a natural consequence of
how things work now.   [ ... reads some more code ... ]  Oh, now it's
only used to determine a freeze limit.  I think you should just remove
the arguments and make that whole thing simpler.  I was going to suggest
that AdvanceRecentGlobalXmin() could receive a possibly-NULL Relation
argument to pass down to GetOldestSnapshotGuts() which can make use of
it for a tigther limit, but since OldestXmin is only for freezing, maybe
there is no point in this extra complication.

Regarding the pendingGlobalXmin stuff, I didn't find any documentation.
I think you just need to write a very long comment on top of
AdvanceRecentGlobalXmin() to explain what it's doing.  After going over
the code half a dozen times I think I understand what's idea, and it
seems sound.

Not sure about the advancexmin_counter thing.  Maybe in some cases
sessions will only run 9 transactions or less and then finish, in which
case it will only get advanced at checkpoint time, which would be way
too seldom.  Maybe it would work to place the counter in shared memory:

acquire(spinlock);
if (ShmemVariableCache->counter++ >= some_number)
{
   SI don't understand this:hmemVariableCache->counter = 0;
   do_update = true;
}
release(spinlock);
if (do_update)
   AdvanceRecentGlobalXmin();
(Maybe we can forgo the spinlock altogether?  If the value gets out of
hand once in a while, it shouldn't really matter)  Perhaps the "some
number" should be scaled to some fraction of MaxBackends, but not less
than X (32?) and no more than Y (128?).  Or maybe just use constant 10,
as the current code does.  But it'd be total number of transactions, not
transaction in the current backend.


I think it'd be cleaner to have a distinct typedef to use when
XLogRecPtr is used as a snapshot representation.  Right now there is no
clarity on whether we're interested in the xlog position itself or it's
a snapshot value.


HeapTupleSatisfiesVacuum[X]() and various callers needs update to their
comments: when OldestXmin is mentioned, it should be OldestSnapshot.

I noticed that SubTransGetTopmostTransaction() is now only called from
predicate.c, and it's pretty constrained in what it wants.  I'm not
implying that we want to do anything in your patch about it, other than
perhaps add a note to it that we may want to examine it later for
possible changes.

I haven't gone over the transam.c, clog.c changes yet.

I attach a couple of minor tweaks to the README, mostly for clarity (but
also update clog -> csnlog in a couple of places).

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

diff --git a/src/backend/access/transam/README b/src/backend/access/transam/README
index d11c817..23479b6 100644
--- a/src/backend/access/transam/README
+++ b/src/backend/access/transam/README
@@ -247,24 +247,24 @@ What we actually enforce is strict serialization of commits and rollbacks
 with snapshot-taking. We use the LSNs generated by Write-Ahead-Logging as
 a convenient monotonically increasing counter, to serialize commits with
 snapshots. Each commit is naturally assigned an LSN; it's the LSN of the
-commit WAL record. Snapshots are also represented by an LSN; all commits
-with a commit record's LSN <= the snapshot's LSN are considered as visible
-to the snapshot. Acquiring a snapshot is a matter of reading the current
-WAL insert location.
+commit WAL record. Snapshots are also represented by an LSN.  All commits
+with a commit record's LSN less than or equal to the snapshot's LSN are
+considered to be visible to the snapshot. Acquiring a snapshot is a matter
+of reading the current WAL insert location.
 
 When checking the visibility of a tuple, we need to look up the commit LSN
 of the xmin/xmax. For that purpose, we store the commit LSN of each
-transaction in the commit log (clog). However, storing the LSN in the
-clog is not atomic with writing the WAL record: it's possible that
-another backend takes a snapshot right after a commit WAL record was inserted
-to WAL, but the clog hasn't been updated yet. To close that race condition,
-just before writing the commit WAL record, the committing backend sets the
-clog entry to a special value, COMMITLSN_COMMITTING. It is replaced with
-the commit record's LSN after the WAL record has been written. When a
-backend looks up a transaction's commit LSN in the clog and sees
-COMMITLSN_COMMITTING, it waits for the commit to finish, by calling
-XactLockTableWait(). That's quite heavy-weight, but the race should
-happen rarely.
+transaction in the commit sequence number log (csnlog). However, storing
+the LSN in the csnlog is not atomic with writing the WAL record: it's
+possible that another backend takes a snapshot right after a commit WAL
+record was inserted to WAL, but the csnlog hasn't been updated yet. To
+close that race condition, just before writing the commit WAL record, the
+committing backend sets the csnlog entry to a special value,
+COMMITLSN_COMMITTING. It is replaced with the commit record's LSN after
+the WAL record has been written.  When a backend looks up a transaction's
+commit LSN in the csnlog and sees COMMITLSN_COMMITTING, it waits for the
+commit to finish by calling XactLockTableWait(). That's quite
+heavy-weight, but the race should happen rarely.
 
 So, a snapshot is simply an LSN, such that all transactions that committed
 before that LSN are visible, and everything later is still considered as

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Proposal for CSN based snapshots

Reply via email to