On Tue, Nov 10, 2015 at 3:46 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Sun, Nov 8, 2015 at 6:35 PM, Michael Paquier > <michael.paqu...@gmail.com> wrote: >> Sure. Now imagine that the pg_twophase entry is corrupted for this >> transaction on one node. This would trigger a PANIC on it, and >> transaction would not be committed everywhere. > > If the database is corrupted, there's no way to guarantee that > anything works as planned. This is like saying that criticizing > somebody's disaster recovery plan on the basis that it will be > inadequate if the entire planet earth is destroyed.
As well as there could be FS, OS, network problems... To come back to the point, my point is simply that I found surprising the sentence of Konstantin upthread saying that if commit fails on some of the nodes we should rollback the prepared transaction on all nodes. In the example given, in the phase after calling dtm_end_prepare, say we perform COMMIT PREPARED correctly on node 1, but then failed it on node 2 because a meteor has hit a server, it seems that we cannot rollback, instead we had better rolling in a backup and be sure that the transaction gets committed. How would you rollback the transaction already committed on node 1? But perhaps I missed something... >> I am aware of the fact >> that by definition PREPARE TRANSACTION ensures that a transaction will >> be committed with COMMIT PREPARED, just trying to see any corner cases >> with the approach proposed. The DTM approach is actually rather close >> to what a GTM in Postgres-XC does :) > > Yes. I think that we should try to learn as much as possible from the > XC experience, but that doesn't mean we should incorporate XC's fuzzy > thinking about 2PC into PG. We should not. Yep. > One point I'd like to mention is that it's absolutely critical to > design this in a way that minimizes network roundtrips without > compromising correctness. XC's GTM proxy suggests that they failed to > do that. I think we really need to look at what's going to be on the > other sides of the proposed APIs and think about whether it's going to > be possible to have a strong local caching layer that keeps network > roundtrips to a minimum. We should consider whether the need for such > a caching layer has any impact on what the APIs should look like. At this time, the number of round trips needed particularly for READ COMMITTED transactions that need a new snapshot for each query was really a performance killer. We used DBT-1 (TPC-W) which is less OLTP-like than DBT-2 (TPC-C), still with DBT-1 the scalability limit was quickly reached with 10-20 nodes.. > For example, consider a 10-node cluster where each node has 32 cores > and 32 clients, and each client is running lots of short-running SQL > statements. The demand for snapshots will be intense. If every > backend separately requests a snapshot for every SQL statement from > the coordinator, that's probably going to be terrible. We can make it > the problem of the stuff behind the DTM API to figure out a way to > avoid that, but maybe that's going to result in every DTM needing to > solve the same problems. This recalls a couple of things, though in 2009 I was not playing with servers of this scale. > Another thing that I think we need to consider is fault-tolerance. > For example, suppose that transaction commit and snapshot coordination > services are being provided by a central server which keeps track of > the global commit ordering. When that server gets hit by a freak bold > of lightning and melted into a heap of slag, somebody else needs to > take over. Again, this would live below the API proposed here, but I > think it really deserves some thought before we get too far down the > path. XC didn't start thinking about how to add fault-tolerance until > quite late in the project, I think, and the same could be said of > PostgreSQL itself: some newer systems have easier-to-use fault > tolerance mechanisms because it was designed in from the beginning. > Distributed systems by nature need high availability to a far greater > degree than single systems, because when there are more nodes, node > failures are more frequent. Your memories on the matter are right. In the case of XC, the SPOF that is GTM has been somewhat made more stable with the creation of a GTM standby, though it did not solve the scalability limit of having a centralized snapshot facility. It actually increased the load on the whole system because for short transactions. Alea jacta est. -- Michael -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers