On Fri, Dec 8, 2017 at 3:36 PM, Amit Kapila <amit.kapil...@gmail.com> wrote: > On Wed, Dec 6, 2017 at 11:26 PM, Andres Freund <and...@anarazel.de> wrote: >>> Either way, it is not clear to me how we will keep it >>> updated after recovery. Right now, the mechanism is quite simple, at >>> the beginning of a recovery we take the value of nextXid from >>> checkpoint record and then if any xlog record indicates xid that >>> follows nextXid, we advance it. Here, the point to note is that we >>> take the xid from the WAL record (which means that it assumes xids are >>> non-contiguous or some xids are consumed without being logged) and >>> increment it. Unless we plan to change something in that logic (like >>> storing 64-bit xids in WAL records), it is not clear to me how to make >>> it work. OTOH, recovering value of epoch which increments only at >>> wraparound seems fairly straightforward as described in my previous >>> email. >> >> I think it should be fairly simple if simply add the 64bit xid to the >> existing clog extension WAL records. > > IIUC, you mean to say that we should log the 64bit xid value in > CLOG_ZEROPAGE record while extending clog and that too we can do only > at wraparound. Now, maybe doing it every time also doesn't hurt, but > I think doing it at wraparound should be sufficient.
Can you please elaborate on why clog redo in particular would need to use 64 bit xids? Is the 64 bit xid not derivable from the 32 bit xid in the WAL + the current value of a new 64 bit next xid? > Just to be clear, I am not planning to pursue writing a patch for this > at the moment. So, if anybody else is interested or if Andres wants > to write it, I can help in the review. I played around with this idea yesterday. Experiment-grade patch attached. Approach: 1. Introduce a new type BigTransactionId (better names welcome). 2. Change ShmemVariableCache->nextXid to ShmemVariableCache->nextBigXid. 3. Change checkpoints to use nextBigXid too. 4. Change ReadNewTransactionId() to ReadNextBigTransactionId(). 5. Remove GetNextXidAndEpoch() as it's now redundant. 6. Everywhere that was reading ShmemVariableCache->nextXid or calling ReadNewTransactionId() but actually needs an xid now uses an explicit conversion macro XidFromBigTransactionId(). 7. Everywhere that was writing ShmemVariableCache->nextXid but only has an xid now goes through a new function AdvanceNextBigTransactionIdPast(xid), to handle recovery (since WAL records have xids in them as mentioned by Amit). I think it's probably a good idea to make it very explicit when moving between big and small transaction IDs, hence the including of the word 'big' in variable and function names and the use of a function-like macro (rather than implicit conversion, which C doesn't give me a good way to prevent). Otherwise there is a class of bug that is hidden for the first 2^32 transactions. The logic in CreateCheckPoint() that assumed that there could be only one wraparound per checkpoint is gone (the problem reported in this thread). Note that AdvanceNextBigTransactionIdPast() contains logic that infers an epoch, which is in some ways similar to what the removed code was doing, but in this case I think all callers must have an xid from the same or next epoch. I'm probably missing a few details... this is only a first swing at the problem. It does pass check-world and various xid wraparound tests I came up with. Clearly big xids could spread to more places than I show here. Do you see problems with this approach or have better ideas? -- Thomas Munro http://www.enterprisedb.com
0001-Track-the-next-xid-using-64-bits.patch
Description: Binary data