On Wed, Oct 26, 2016 at 2:51 PM, Masahiko Sawada <sawada.m...@gmail.com> wrote: > On Fri, Oct 21, 2016 at 2:38 PM, Ashutosh Bapat > <ashutosh.ba...@enterprisedb.com> wrote: >> On Wed, Oct 19, 2016 at 9:17 PM, Robert Haas <robertmh...@gmail.com> wrote: >>> On Thu, Oct 13, 2016 at 7:27 AM, Amit Langote >>> <langote_amit...@lab.ntt.co.jp> wrote: >>>> However, when I briefly read the description in "Transaction Management in >>>> the R* Distributed Database Management System (C. Mohan et al)" [2], it >>>> seems that what Ashutosh is saying might be a correct way to proceed after >>>> all: >>> >>> I think Ashutosh is mostly right, but I think there's a lot of room to >>> doubt whether the design of this patch is good enough that we should >>> adopt it. >>> >>> Consider two possible designs. In design #1, the leader performs the >>> commit locally and then tries to send COMMIT PREPARED to every standby >>> server afterward, and only then acknowledges the commit to the client. >>> In design #2, the leader performs the commit locally and then >>> acknowledges the commit to the client at once, leaving the task of >>> running COMMIT PREPARED to some background process. Design #2 >>> involves a race condition, because it's possible that the background >>> process might not complete COMMIT PREPARED on every node before the >>> user submits the next query, and that query might then fail to see >>> supposedly-committed changes. This can't happen in design #1. On the >>> other hand, there's always the possibility that the leader's session >>> is forcibly killed, even perhaps by pulling the plug. If the >>> background process contemplated by design #2 is well-designed, it can >>> recover and finish sending COMMIT PREPARED to each relevant server >>> after the next restart. In design #1, that background process doesn't >>> necessarily exist, so inevitably there is a possibility of orphaning >>> prepared transactions on the remote servers, which is not good. Even >>> if the DBA notices them, it won't be easy to figure out whether to >>> commit them or roll them back. >>> >>> I think this thought experiment shows that, on the one hand, there is >>> a point to waiting for commits on the foreign servers, because it can >>> avoid the anomaly of not seeing the effects of your own commits. On >>> the other hand, it's ridiculous to suppose that every case can be >>> handled by waiting, because that just isn't true. You can't be sure >>> that you'll be able to wait long enough for COMMIT PREPARED to >>> complete, and even if that works out, you may not want to wait >>> indefinitely for a dead server. Waiting for a ROLLBACK PREPARED has >>> no value whatsoever unless the system design is such that failing to >>> wait for it results in the ROLLBACK PREPARED never getting performed >>> -- which is a pretty poor excuse. >>> >>> Moreover, there are good reasons to think that doing this kind of >>> cleanup work in the post-commit hooks is never going to be acceptable. >>> Generally, the post-commit hooks need to be no-fail, because it's too >>> late to throw an ERROR. But there's very little hope that a >>> connection to a remote server can be no-fail; anything that involves a >>> network connection is, by definition, prone to failure. We can try to >>> guarantee that every single bit of code that runs in the path that >>> sends COMMIT PREPARED only raises a WARNING or NOTICE rather than an >>> ERROR, but that's going to be quite difficult to do: even palloc() can >>> throw an error. And what about interrupts? We don't want to be stuck >>> inside this code for a long time without any hope of the user >>> recovering control of the session by pressing ^C, but of course the >>> way that works is it throws an ERROR, which we can't handle here. We >>> fixed a similar issue for synchronous replication in >>> 9a56dc3389b9470031e9ef8e45c95a680982e01a by making an interrupt emit a >>> WARNING in that case and then return control to the user. But if we >>> do that here, all of the code that every FDW emits has to be aware of >>> that rule and follow it, and it just adds to the list of ways that the >>> user backend can escape this code without having cleaned up all of the >>> prepared transactions on the remote side. >> >> Hmm, IIRC, my patch and possibly patch by Masahiko-san and Vinayak, >> tries to resolve prepared transactions in post-commit code. I agree >> with you here, that it should be avoided and the backend should take >> over the job of resolving transactions. >> >>> >>> It seems to me that the only way to really make this feature robust is >>> to have a background worker as part of the equation. The background >>> worker launches at startup and looks around for local state that tells >>> it whether there are any COMMIT PREPARED or ROLLBACK PREPARED >>> operations pending that weren't completed during the last server >>> lifetime, whether because of a local crash or remote unavailability. >>> It attempts to complete those and retries periodically. When a new >>> transaction needs this type of coordination, it adds the necessary >>> crash-proof state and then signals the background worker. If >>> appropriate, it can wait for the background worker to complete, just >>> like a CHECKPOINT waits for the checkpointer to finish -- but if the >>> CHECKPOINT command is interrupted, the actual checkpoint is >>> unaffected. >> >> My patch and hence patch by Masahiko-san and Vinayak have the >> background worker in the equation. The background worker tries to >> resolve prepared transactions on the foreign server periodically. >> IIRC, sending it a signal when another backend creates foreign >> prepared transactions is not implemented. That may be a good addition. >> >>> >>> More broadly, the question has been raised as to whether it's right to >>> try to handle atomic commit and atomic visibility as two separate >>> problems. The XTM API proposed by Postgres Pro aims to address both >>> with a single stroke. I don't think that API was well-designed, but >>> maybe the idea is good even if the code is not. Generally, there are >>> two ways in which you could imagine that a distributed version of >>> PostgreSQL might work. One possibility is that one node makes >>> everything work by going around and giving instructions to the other >>> nodes, which are more or less unaware that they are part of a cluster. >>> That is basically the design of Postgres-XC and certainly the design >>> being proposed here. The other possibility is that the nodes are >>> actually clustered in some way and agree on things like whether a >>> transaction committed or what snapshot is current using some kind of >>> consensus protocol. It is obviously possible to get a fairly long way >>> using the first approach but it seems likely that the second one is >>> fundamentally more powerful: among other things, because the first >>> approach is so centralized, the leader is apt to become a bottleneck. >>> And, quite apart from that, can a centralized architecture with the >>> leader manipulating the other workers ever allow for atomic >>> visibility? If atomic visibility can build on top of atomic commit, >>> then it makes sense to do atomic commit first, but if we build this >>> infrastructure and then find that we need an altogether different >>> solution for atomic visibility, that will be unfortunate. >>> >> >> There are two problems to solve as far as visibility is concerned. 1. >> Consistency: changes by which transactions are visible to a given >> transaction 2. Making visible, the changes by all the segments of a >> given distributed transaction on different foreign servers, at the >> same time IOW no other transaction sees changes by only few segments >> but does not see changes by all the transactions. >> >> First problem is hard to solve and there are many consistency >> symantics. A large topic of discussion. >> >> The second problem can be solved on top of this infrastructure by >> extending PREPARE transaction API. I am writing down my ideas so that >> they don't get lost. It's not a completed design. >> >> Assume that we have syntax which tells the originating server which >> prepared the transaction. PREPARE TRANSACTION <GID> FOR SERVER <local >> server name> with ID <xid> ,where xid is the transaction identifier on >> local server. OR we may incorporate that information in GID itself and >> the foreign server knows how to decode it. >> >> Once we have that information, the foreign server can actively poll >> the local server to get the status of transaction xid and resolves the >> prepared transaction itself. It can go a step further and inform the >> local server that it has resolved the transaction, so that the local >> server can purge it from it's own state. It can remember the fate of >> xid, which can be consulted by another foreign server if the local >> server is down. If another transaction on the foreign server stumbles >> on a transaction prepared (but not resolved) by the local server, >> foreign server has two options - 1. consult the local server and >> resolve 2. if the first options fails to get the status of xid or that >> if that option is not workable, throw an error e.g. indoubt >> transaction. There is probably more network traffic happening here. >> Usually, the local server should be able to resolve the transaction >> before any other transaction stumbles upon it. The overhead is >> incurred only when necessary. >> > > I think we can consider the atomic commit and the atomic visibility > separately, and the atomic visibility can build on the top of the > atomic commit. We can't provide the atomic visibility across multiple > nodes without consistent update. So I'd like to focus on atomic commit > in this thread. Considering to providing the atomic commit, the two > phase commit protocol is the perfect solution for providing atomic > commit. Whatever type of solution for atomic visibility we have, the > atomic commit by 2PC is necessary feature. We can consider to have the > atomic commit feature that ha following functionalities. > * The local node is responsible for the transaction management among > relevant remote servers using 2PC. > * The local node has information about the state of distributed > transaction state. > * There is a process resolving in-doubt transaction. > > As Ashutosh mentioned, current patch supports almost these > functionalities. But I'm trying to update it so that it can have > multiple foreign server information into one FDWXact file, one entry > on shared buffer. Because in spite of that new remote server can be > added on the fly, we could need to restart local server in order to > allocate the more large shared buffer for fdw transaction whenever > remote server is added. Also I'm incorporating other comments. >
The subject line is removed mistakenly. Please ignore it. Sorry for the noise. Regards, -- Masahiko Sawada NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers