On Wed, Feb 04, 2015 at 04:30:32PM +0000, Matthew Booth wrote: > I've spent a few hours today reading about Galera, a clustering solution > for MySQL. Galera provides multi-master 'virtually synchronous' > replication between multiple mysql nodes. i.e. I can create a cluster of > 3 mysql dbs and read and write from any of them with certain consistency > guarantees. > > I am no expert[1], but this is a TL;DR of a couple of things which I > didn't know, but feel I should have done. The semantics are important to > application design, which is why we should all be aware of them. > > > * Commit will fail if there is a replication conflict > > foo is a table with a single field, which is its primary key. > > A: start transaction; > B: start transaction; > A: insert into foo values(1); > B: insert into foo values(1); <-- 'regular' DB would block here, and > report an error on A's commit > A: commit; <-- success > B: commit; <-- KABOOM > > Confusingly, Galera will report a 'deadlock' to node B, despite this not > being a deadlock by any definition I'm familiar with.
Yes ! and if I can add more information and I hope I do not make mistake I think it's a know issue which comes from MySQL, that is why we have a decorator to do a retry and so handle this case here: http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py#n177 > Essentially, anywhere that a regular DB would block, Galera will not > block transactions on different nodes. Instead, it will cause one of the > transactions to fail on commit. This is still ACID, but the semantics > are quite different. > > The impact of this is that code which makes correct use of locking may > still fail with a 'deadlock'. The solution to this is to either fail the > entire operation, or to re-execute the transaction and all its > associated code in the expectation that it won't fail next time. > > As I understand it, these can be eliminated by sending all writes to a > single node, although that obviously makes less efficient use of your > cluster. > > > * Write followed by read on a different node can return stale data > > During a commit, Galera replicates a transaction out to all other db > nodes. Due to its design, Galera knows these transactions will be > successfully committed to the remote node eventually[2], but it doesn't > commit them straight away. The remote node will check these outstanding > replication transactions for write conflicts on commit, but not for > read. This means that you can do: > > A: start transaction; > A: insert into foo values(1) > A: commit; > B: select * from foo; <-- May not contain the value we inserted above[3] > > This means that even for 'synchronous' slaves, if a client makes an RPC > call which writes a row to write master A, then another RPC call which > expects to read that row from synchronous slave node B, there's no > default guarantee that it'll be there. > > Galera exposes a session variable which will fix this: wsrep_sync_wait > (or wsrep_causal_reads on older mysql). However, this isn't the default. > It presumably has a performance cost, but I don't know what it is, or > how it scales with various workloads. > > > Because these are semantic issues, they aren't things which can be > easily guarded with an if statement. We can't say: > > if galera: > try: > commit > except: > rewind time > > If we are to support this DB at all, we have to structure code in the > first place to allow for its semantics. > > Matt > > [1] No, really: I just read a bunch of docs and blogs today. If anybody > who is an expert would like to validate/correct that would be great. > > [2] > http://www.percona.com/blog/2012/11/20/understanding-multi-node-writing-conflict-metrics-in-percona-xtradb-cluster-and-galera/ > > [3] > http://www.percona.com/blog/2013/03/03/investigating-replication-latency-in-percona-xtradb-cluster/ > -- > Matthew Booth > Red Hat Engineering, Virtualisation Team > > Phone: +442070094448 (UK) > GPG ID: D33C3490 > GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
