Hi, There's been a lot discussed over the past month or so, and it's become difficult to get a good idea what's the current state - what issues remain to be solved, what's unrelated to this patch, and how to move if forward. Long-running threads tend to be confusing, so I had a short call with Amit to discuss the current state yesterday, and to make sure we're on the same page. I believe it was very helpful, and I've promised to post a short summary of the call - issues, what we agreed seems like a path forward, etc.
Obviously, I might have misunderstood something, in which case Amit can correct me. And I'd certainly welcome opinions from others. In general, we discussed three areas - desirability of the feature, correctness and performance. I believe a brief summary of the agreement would be this: - desirability of the feature: Random IDs (UUIDs etc.) are likely a much better solution for distributed (esp. active-active) systems. But there are important use cases that are likely to keep using regular sequences (online upgrades of single-node instances, existing systems, ...). - correctness: There's one possible correctness issue, when the snapshot changes to FULL between record creating a sequence relfilenode and that sequence advancing. This needs to be verified/reproduced, and fixed. - performance issues: We've agreed the case with a lot of aborts (when DecodeCommit consumes a lot of CPU) is unrelated to this patch. We've discussed whether the overhead with many sequence changes (nextval-40) is acceptable, and/or how to improve it. Next, I'll go over these points in more details, with my understanding of what the challenges are, possible solutions etc. Most of this was discussed/agreed on the call, but some are ideas I had only after the call when writing this summary. 1) desirability of the feature Firstly, do we actually want/need this feature? I believe that's very much a question of what use cases we're targeting. If we only focus on distributed databases (particularly those with multiple active nodes), then we probably agree that the right solution is to not use sequences (~generators of incrementing values) but UUIDs or similar random identifiers (better not call them sequences, there's not much sequential about it). The huge advantage is this does not require replicating any state between the nodes, so logical decoding can simply ignore them and replicate just the generated values. I don't think there's any argument about that. If I as building such distributed system, I'd certainly use such random IDs. The question is what to do about the other use cases - online upgrades relying on logical decoding, failovers to logical replicas, and so on. Or what to do about existing systems that can't be easily changed to use different/random identifiers. Those are not really distributed systems and therefore don't quite need random IDs. Furthermore, it's not like random IDs have no drawbacks - UUIDv4 can easily lead to massive write amplification, for example. There are variants like UUIDv7 reducing the impact, but there's other stuff. My takeaway from this is there's still value in having this feature. 2) correctness The only correctness issue I'm aware of is the question what happens when the snapshot switches to SNAPBUILD_FULL_SNAPSHOT between decoding the relfilenode creation and the sequence increment, pointed out by Dilip in [1]. If this happens (and while I don't have a reproducer, I also don't have a very clear idea why it couldn't happen), it breaks how the patch decides between transactional and non-transactional sequence changes. So this seems like a fatal flaw - it definitely needs to be solved. I don't have a good idea how to do that, unfortunately. The problem is the dependency on an earlier record, and that this needs to be evaluated immediately (in the decode phase). Logical messages don't have the same issue because the "transactional" flag does not depend on earlier stuff, and other records are not interpreted until apply/commit, when we know everything relevant was decoded. I don't know what the solution is. Either we find a way to make sure not to lose/skip the smgr record, or we need to rethink how we determine the transactional flag (perhaps even try again adding it to the WAL record, but we didn't find a way to do that earlier). 3) performance issues We have discussed two cases - "ddl-abort" and "nextval-40". The "ddl-abort" is when the workload does a lot of DDL and then aborts them, leading to profiles dominated by DecodeCommit. The agreement here is that while this is a valid issue and we should try fixing it, it's unrelated to this patch. The issue exists even on master. So in the context of this patch we can ignore this issue. The "nextval-40" applies to workloads doing a lot of regular sequence changes. We only decode/apply changes written to WAL, and that happens only for every 32 increments or so. The test was with a very simple transaction (just sequence advanced to write WAL + 1-row insert), which means it's pretty much a worst case impact. For larger transactions, it's going to be hardly measurable. Also, this only measured decoding, not apply (which also will make this less significant). Most of the overhead comes from ReorderBufferQueueSequence() starting and then aborting a transaction, per the profile in [2]. This only happens in the non-transactional case, but we expect that in regular Anyway, let's say we want to mitigate this overhead. I think there are three ways to do that: a) find a way to not have to apply sequence changes immediately, but queue them until the next commit This would give a chance to combine multiple sequence changes into a single "replay change", reducing the overhead. There's a couple problems with this, though. Firstly, it can't help OLTP workloads because the transactions are short so sequence changes are unlikely to combine. It's also, not clear how expensive this be - could it be expensive enough to outweigh the benefits? All of this is assuming it can be implemented, we don't have such patch yet. I was speculating about something like this earlier, but I haven't managed to make that work. Doesn't mean it's impossible, ofc. b) provide a way for the output plugin to skip sequence decoding early The way the decoding is coded now, ReorderBufferQueueSequence does all the expensive dance even if the output plugin does not implement the sequence callbacks. Maybe we should have a way to allow skipping all of this early, right at the beginning of ReorderBufferQueueSequence (and thus before we even try to start/abort the transaction). Ofc, this is not a perfect solution either - it won't help workloads that actually need/want sequence decoding but the workload is such that the decoding has significant overhead, or with plugins that choose to support decoding sequences in genera. For example the built-in output plugin would certainly support sequences - and the overhead would still be there (even if no sequences are added to the publication). b) instruct people to increase the sequence cache from 32 to 1024 This would reduce the number of WAL messages that need to be decoded and replayed, reducing the overhead proportionally. Of course, this also means the sequence will "jump forward" more in case of crash or failover to the logical replica, but I think that's acceptable tradeoff. People should not expect sequences to be gap-less anyway. Considering nextval-40 is pretty much a worst-case behavior, I think this might actually be an acceptable solution/workaround. regards [1] https://www.postgresql.org/message-id/CAFiTN-vAx-Y%2B19ROKOcWnGf7ix2VOTUebpzteaGw9XQyCAeK6g%40mail.gmail.com [2] https://www.postgresql.org/message-id/0bc34f71-7745-dc16-d765-5ba1f0776a3f%40enterprisedb.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company