On Tue, Sep 20, 2016 at 11:13 PM, Stas Kelvich <s.kelv...@postgrespro.ru> wrote: >> On 20 Sep 2016, at 09:40, Michael Paquier <michael.paqu...@gmail.com> wrote: >> + StartupCLOG(); >> + StartupSUBTRANS(checkPoint.oldestActiveXid); >> + RecoverPreparedFromFiles(); >> [...] >> /* >> - * Startup commit log and subtrans only. MultiXact and commit >> - * timestamp have already been started up and other SLRUs are not >> - * maintained during recovery and need not be started yet. >> - */ >> - StartupCLOG(); >> - StartupSUBTRANS(oldestActiveXID); >> Something is definitely wrong in this patch if we begin to do that. > > Putting that before actual WAL replay is just following historical order of > events. > Prepared files are pieces of WAL that happened before checkpoint, so ISTM > there is no conceptual problem in restoring their state before replay.
For wal_level = minimal there is no need to call that to begin with.. And I'd think that it is better to continue with things the same way. >> I have been thinking about this patch quite a bit, and the approach >> taken looks really over-complicated to me. Basically what we are >> trying to do here is to reuse as much as possible code paths that are >> being used by non-recovery code paths during recovery, and then try to >> adapt a bunch of things like SUBTRANS structures, CLOG, XID handling >> and PGXACT things in sync to handle the 2PC information in memory >> correctly. I am getting to think that this is *really* bug-prone in >> the future and really going to make maintenance hard. > > With this patch we are reusing one infrastructure for normal work, recovery > and replay. The mix between normal work and recovery is the scary part. Normal work and recovery are entirely two different things. >> Taking one step back, and I know that I am a noisy one, but looking at >> this patch is making me aware of the fact that it is trying more and >> more to patch things around the more we dig into issues, and I'd >> suspect that trying to adapt for example sub-transaction and clog >> handling is just the tip of the iceberg and that there are more things >> that need to be taken care of if we continue to move on with this >> approach. > > Again, it isn’t patching around to fix issues, it’s totally possible to keep > old interface. > However it’s possible that current approach is wrong because of some aspect > that i didn’t think of, but now I don’t see good counterarguments. Any missing points could be costly here. If we have a way to make things with the same performance >> Coming to the point: why don't we simplify things? In short: > >> - Just use a static list and allocate a chunk of shared memory as >> needed. DSM could come into play here to adapt to the size of a 2PC >> status file, this way there is no need to allocate a chunk of memory >> with an upper bound. Still we could use an hardcoded upper-bound of >> say 2k with max_prepared_transactions, and just write the 2PC status >> file to disk if its size is higher than what can stick into memory. >> - save 2PC information in memory instead of writing it to a 2PC file >> when XLOG_XACT_PREPARE shows up. >> - flush information to disk when there is a valid restart point in >> RecoveryRestartPoint(). We would need as well to tweak >> PrescanPreparedTransactions accordingly than everything we are trying >> to do here. >> That would be way more simple than what's proposed here, and we'd >> still keep all the performance benefits. > > So file arrives to replica through WAL and instead of directly reading it you > suggest > to copy it do DSM if it is small enough, and to filesystem if not. Probably > that will > allow us to stay only around reading/writing files, but: > > * That will be measurably slower than proposed approach because of unnecessary > copying between WAL and DSM. Not to mention prepare records of arbitrary > length. > * Different behaviour on replica and master — less tested code for replica. Well, the existing code on HEAD is battery-tested as well. This reduces the likeliness of bugs at replay with new features. > * Almost the same behaviour can be achieved by delaying fsync on 2pc files > till > checkpoint. Oh, that's an idea here. It may be interesting to see if this meets your performance goals... And that could result in a far smaller patch. > But if reword you comment in a way that it is better to avoid replaying > prepare record at all, > like it happens now in master, then I see the point. And that is possible > even without DSM, it possible > to allocate static sized array storing some info about tx, whether it is in > the WAL or in file, xid, gid. > Some sort of PGXACT doppelganger only for replay purposes instead of using > normal one. That's exactly what I meant. The easiest approach is just to allocate a bunk of shared memory made of 2PC_STATUS_SIZE * max_prepared_xacts with 2PC_STATUS_SIZE set up at an arbitrary size that we find appropriate to store the information of the file. DSM may be useful to take care of the case where the status file is bigger than one slot, but with a sufficiently wise upper bound, we can get away without it. > So taking into account my comments what do you think? Should we keep current > approach > or try to avoid replaying prepare records at all? I'd really like to give a try to the idea you just mentioned, aka delay the fsync of the 2PC files from RecreateTwoPhaseFile to CreateRestartPoint, or get things into memory.. I could write one, or both of those patches. I don't mind. -- Michael -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers