On Mon, Jan 23, 2017 at 9:00 PM, Nikhil Sontakke <nikh...@2ndquadrant.com> wrote: > Hi Stas, > >> >> So, here is brand new implementation of the same thing. >> >> Now instead of creating pgproc entry for prepared transaction during >> recovery, >> I just store recptr/xid correspondence in separate 2L-list and >> deleting entries in that >> list if redo process faced commit/abort. In case of checkpoint or end >> of recovery >> transactions remaining in that list are dumped to files in pg_twophase. >> >> Seems that current approach is way more simpler and patch has two times less >> LOCs then previous one. >> > > The proposed approach and patch does appear to be much simpler than > the previous one. > > I have some comments/questions on the patch (twophase_recovery_list_2.diff): > > 1) > + /* > + * Move prepared transactions from KnownPreparedList to files, if any. > + * It is possible to skip that step and teach subsequent code about > + * KnownPreparedList, but whole PrescanPreparedTransactions() happens > + * once during end of recovery or promote, so probably it isn't worth > + * complications. > + */ > + KnownPreparedRecreateFiles(InvalidXLogRecPtr); > + > > Speeding up recovery or failover activity via a faster promote is a > desirable thing. So, maybe, we should look at teaching the relevant > code about using "KnownPreparedList"? I know that would increase the > size of this patch and would mean more testing, but this seems to be > last remaining optimization in this code path.
That's a good idea, worth having in this patch. Actually we may not want to call KnownPreparedRecreateFiles() here as promotion is not synonym of end-of-recovery checkpoint for a couple of releases now. > 3) We are pushing the fsyncing of 2PC files to the checkpoint replay > activity with this patch. Now, typically, we would assume that PREPARE > followed by COMMIT/ABORT would happen within a checkpoint replay > cycle, if not, this would make checkpoint replays slower since the > earlier spread out fsyncs are now getting bunched up. I had concerns > about this but clearly your latest performance measurements don't show > any negatives around this directly. Most of the 2PC syncs just won't happen, such transactions normally don't last long, and the number you would get during a checkpoint is largely lower than what would happen between two checkpoints. When working on Postgres-XC, the number of 2PC was really more funky. > 6) Do we need to hold TwoPhaseStateLock lock in shared mode while > calling KnownPreparedRecreateFiles()? I think, it does not matter in > the recovery/replay code path. It does not matter as the only code path tracking that would be the checkpointer in TwoPhaseCheckpoint(), and this 2PC recovery patch does not touch anymore TwoPhaseStateData. Actually the patch makes no use of it. > 7) I fixed some typos and comments. PFA, patch attached. > > Other than this, I ran TAP tests and they succeed as needed. > > On 23 January 2017 at 16:56, Stas Kelvich <s.kelv...@postgrespro.ru> wrote: >> I did tests with following setup: >> >> * Start postgres initialised with pgbench >> * Start pg_receivexlog >> * Take basebackup >> * Perform 1.5 M transactions >> * Stop everything and apply wal files stored by pg_receivexlog to base >> backup. >> >> All tests performed on a laptop with nvme ssd >> number of transactions: 1.5M >> start segment: 0x4 >> >> -master non-2pc: >> last segment: 0x1b >> average recovery time per 16 wal files: 11.8s >> average total recovery time: 17.0s >> >> -master 2pc: >> last segment: 0x44 >> average recovery time per 16 wal files: 142s >> average total recovery time: 568s >> >> -patched 2pc (previous patch): >> last segment: 0x44 >> average recovery time per 16 wal files: 5.3s >> average total recovery time: 21.2s >> >> -patched2 2pc (dlist_push_tail changed to dlist_push_head): >> last segment: 0x44 >> average recovery time per 16 wal files: 5.2s >> average total recovery time: 20.8s The difference between those two is likely noise. By the way, in those measurements, the OS cache is still filled with the past WAL segments, which is a rather best case, no? What happens if you do the same kind of tests on a box where memory is busy doing something else and replayed WAL segments get evicted from the OS cache more aggressively once the startup process switches to a new segment? This could be tested for example on a VM with few memory (say 386MB or less) so as the startup process needs to access again the past WAL segments to recover the 2PC information it needs to get them back directly from disk... One trick that you could use here would be to tweak the startup process so as it drops the OS cache once a segment is finished replaying, and see the effects of an aggressive OS cache eviction. This patch is showing really nice improvements with the OS cache backing up the data, still it would make sense to test things with a worse test case and see if things could be done better. The startup process now only reads records sequentially, not randomly which is a concept that this patch introduces. Anyway, perhaps this does not matter much, the non-recovery code path does the same thing as this patch, and the improvement is too much to be ignored. So for consistency's sake we could go with the approach proposed which has the advantage to not put any restriction on the size of the 2PC file contrary to what an implementation saving the contents of the 2PC files into memory would need to do. >> So skipping unnecessary fsyncs gave us x25 speed increase even on ssd (on >> hdd difference should be bigger). >> Pushing to list's head didn’t yield measurable results, but anyway seems to >> be conceptually better. That's in the logic of more recent segments getting priority as startup process reads the records sequentially, so it is definitely better. -- Michael -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers